2011
DOI: 10.1093/comjnl/bxr096
|View full text |Cite
|
Sign up to set email alerts
|

Boosting Text Compression with Word-Based Statistical Encoding

Abstract: Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 53 publications
0
5
0
Order By: Relevance
“…. entries, up to a predetermined maximal size, say 2 18 . There are several options to continue, like restarting from scratch with 9 bits, or considering the dictionary as static and not adjoining any more strings.…”
Section: Lzwmentioning
confidence: 99%
See 1 more Smart Citation
“…. entries, up to a predetermined maximal size, say 2 18 . There are several options to continue, like restarting from scratch with 9 bits, or considering the dictionary as static and not adjoining any more strings.…”
Section: Lzwmentioning
confidence: 99%
“…It is therefore natural to assume that once as much redundancy as possible has been removed, the remaining text should be indistinguishable from random data, for if some more regularities can be detected, they could be targeted in an additional round of compression. Indeed, Fariña et al [18] focus on byte-oriented word based compressors for natural languages, and show that such compressed files can be further compressed using any general purpose compressor such as gzip. They note that the frequencies of the byte values generated by a byte-oriented, word based, compressor, are far from uniform, as opposed to the output of arithmetic coding.…”
Section: Introductionmentioning
confidence: 99%
“…Before applying the BWT, we encode the text using a dense code. First we use the (s, c, b, o)-DC (SCBDC) [10] scheme, which is both prefix-and suffix-free and thus requires no verifications. We use this code on nybbles, i.e., set the parameters in a way to have s + c + b + o = 16.…”
Section: Fm-dummymentioning
confidence: 99%
“…The authors in [5] exploit word based byte-oriented compression, and then transit text through character positioned compression. They use end-tagged dense code, as it is easier to build than a Huffman code.…”
Section: Related Workmentioning
confidence: 99%