2016
DOI: 10.1155/2016/9483646
|View full text |Cite
|
Sign up to set email alerts
|

n-Gram-Based Text Compression

Abstract: We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(7 citation statements)
references
References 20 publications
0
4
0
Order By: Relevance
“…Based on the proprietary RAR compression algorithm developed by Eugene Roshal, WinRAR may be considered superior to 7-Zip due to its exceptional compression capabilities. The RAR algorithm employs a combination of Huffman coding, run-length encoding, and adaptive coding to exploit data structures and patterns effectively, resulting in consistently high compression rates [16]. The emphasis on Huffman coding, a key component of the RAR algorithm, signifies a meticulous approach to compression [17].…”
Section: Resultsmentioning
confidence: 99%
“…Based on the proprietary RAR compression algorithm developed by Eugene Roshal, WinRAR may be considered superior to 7-Zip due to its exceptional compression capabilities. The RAR algorithm employs a combination of Huffman coding, run-length encoding, and adaptive coding to exploit data structures and patterns effectively, resulting in consistently high compression rates [16]. The emphasis on Huffman coding, a key component of the RAR algorithm, signifies a meticulous approach to compression [17].…”
Section: Resultsmentioning
confidence: 99%
“…In the transformer knowledge domain, many terminologies are composed of two or more words (e.g., oil temperature indicator). N-gram represents a contiguous sequence of n items from a given text or speech (Koehn, 2009;Nguyen, 2016). An n-gram of one item, two items, or three items is referred to as a "unigram," a "bigram," or a "trigram," respectively.…”
Section: Key Term Extractionmentioning
confidence: 99%
“…'N-gram is a contiguous sequence of items from a given line of a text or speech [11]. Ngram is a method that is applied for word or character generation.…”
Section: N-grammentioning
confidence: 99%