Boosting Text Compression with Word-Based Statistical Encoding

Fariña, Antonio; Navarro, Gonzalo; Paramá, José R.

doi:10.1093/comjnl/bxr096

Cited by 11 publications

(7 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…. entries, up to a predetermined maximal size, say 2 18 . There are several options to continue, like restarting from scratch with 9 bits, or considering the dictionary as static and not adjoining any more strings.…”

Section: Lzwmentioning

confidence: 99%

“…It is therefore natural to assume that once as much redundancy as possible has been removed, the remaining text should be indistinguishable from random data, for if some more regularities can be detected, they could be targeted in an additional round of compression. Indeed, Fariña et al [18] focus on byte-oriented word based compressors for natural languages, and show that such compressed files can be further compressed using any general purpose compressor such as gzip. They note that the frequencies of the byte values generated by a byte-oriented, word based, compressor, are far from uniform, as opposed to the output of arithmetic coding.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Randomness of Compressed Data

Klein

Shapira

2020

Information

View full text Add to dashboard Cite

It seems reasonable to expect from a good compression method that its output should not be further compressible, because it should behave essentially like random data. We investigate this premise for a variety of known lossless compression techniques, and find that, surprisingly, there is much variability in the randomness, depending on the chosen method. Arithmetic coding seems to produce perfectly random output, whereas that of Huffman or Ziv-Lempel coding still contains many dependencies. In particular, the output of Huffman coding has already been proven to be random under certain conditions, and we present evidence here that arithmetic coding may produce an output that is identical to that of Huffman.

show abstract

Section: Lzwmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On the Randomness of Compressed Data

Klein

Shapira

2020

Information

View full text Add to dashboard Cite

show abstract

“…Before applying the BWT, we encode the text using a dense code. First we use the (s, c, b, o)-DC (SCBDC) [10] scheme, which is both prefix-and suffix-free and thus requires no verifications. We use this code on nybbles, i.e., set the parameters in a way to have s + c + b + o = 16.…”

Section: Fm-dummymentioning

confidence: 99%

FM-index for Dummies

Grabowski

Raniszewski

Deorowicz

2017

Communications in Computer and Information Science

View full text Add to dashboard Cite

The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rankhandling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2-3 times faster than the fastest known ones, for the price of using typically 1.5-5 times more space.Count-Occs(T bwt , n, P , m)

show abstract

“…The authors in [5] exploit word based byte-oriented compression, and then transit text through character positioned compression. They use end-tagged dense code, as it is easier to build than a Huffman code.…”

Section: Related Workmentioning

confidence: 99%

A Novel Text Processing for Better Compression and Security in Cloud

Çankaya¹,

Vinayak²

2016

IJCTE

View full text Add to dashboard Cite

We introduce LG-encoding, a novel approach to text encoding that shuffles the position of letters anticipating an improved compression performance. Our technique brings together the repeating letters in a word, so as to inflate redundancy to be exploited by the compression algorithm to follow. The encoding process introduces no significant overhead: It is easily reversible as it only involves repositioning the letters in a text. We experiment LG-encoding on text from 4 different source languages: English, French, German, and Spanish with a set of well-known compression algorithms that follows the encoding: Arithmetic Coding, Huffman Coding, BWT and PPM. Our results yield promising outcomes as we achieve substantially better compression rates for Arithmetic Coding and Huffman Coding that follows LG-encoding. We also propose use of our method in large data repositories, such as cloud, as it also provides significant level of security by shuffling the letters of words in text. Index Terms-Text encoding, lossless text compression.

show abstract

Boosting Text Compression with Word-Based Statistical Encoding

Cited by 11 publications

References 53 publications

On the Randomness of Compressed Data

On the Randomness of Compressed Data

FM-index for Dummies

A Novel Text Processing for Better Compression and Security in Cloud

Contact Info

Product

Resources

About