The Smallest Grammar Problem Revisited

Bannai, Hideo; Hirayama, Momoko; Hucke, Danny; Inenaga, Shunsuke; Jeż, Artur; Lohrey, Markus; Reh, Carl Philipp

doi:10.1109/tit.2020.3038147

Cited by 6 publications

(7 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other encodings were recently studied by Ganczorz [5]. Since Re-Pair is a so-called irreducible grammar, its grammar size, i.e., the sum of the symbols on the right-hand side of all rules, is upper bounded by O(n/ log σ n) ([3], Lemma 2), which matches the information-theoretic lower bound on the size of a grammar for a string of length n. Comparing this size with the size of the smallest grammar, its approximation ratio has O((n/ lg n) 2/3 ) as an upper bound [6] and Ω(lg n/ lg lg n) as a lower bound [7]. On the practical side, Yoshida and Kida [8] presented an efficient fixed-length code for compressing the Re-Pair grammar.…”

Section: Introductionsupporting

confidence: 58%

Re-Pair in Small Space

et al. 2020

View full text Add to dashboard Cite

Re-Pairis a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large-scale data sets. As a solution for this problem, we present, given a text of length n whose characters are drawn from an integer alphabet with size σ=nO(1), an O(min(n2,n2lglogτnlglglgn/logτn)) time algorithm computing Re-Pair with max((n/c)lgn,nlgτ)+O(lgn) bits of working space including the text space, where c≥1 is a fixed user-defined constant and τ is the sum of σ and the number of non-terminals. We give variants of our solution working in parallel or in the external memory model. Unfortunately, the algorithm seems not practical since a preliminary version already needs roughly one hour for computing Re-Pair on one megabyte of text.

show abstract

Section: Introductionsupporting

confidence: 58%

Re-Pair in Small Space

et al. 2020

View full text Add to dashboard Cite

show abstract

“…To this end, one either has to improve the upper bound z b = O(bz log n z ) or has to provide a more elaborate series of examples improving the lower bound z b = Ω(bz) from Section 3 (obviously, the examples must deal with non-phrase-aligned parsings). We point out, however, that the tightness of the bound from Theorem 3 would necessarily imply the tightness of the currently best upper bound g = O(z log n z ) [4,18] from Lemma 3 that relates the size g of the minimal grammar generating the string and the size z of the LZ77 parsing for the string (the best lower bound up-to-date is g = Ω(z log n log log n ) [8,10]). Indeed, for a constant b > 1, if there exists a string whose LZ77 parsing has size z and whose b-block contraction can have only LZ77 parsings of size at least Ω(z log n z ), then the minimal grammar of such string must have a size of at least g = Ω(z log n z ) since, by Lemma 3, the string has a phrase-aligned LZ77 parsing of size g, and thus, by Theorem 2, the b-block contraction has an LZ77 parsing of size O(bg), which is O(g) as b is constant.…”

Section: Discussionmentioning

confidence: 96%

“…By a non-constructive argument [8,10,16], one can show that the converse equivalent reduction from LZ77 parsings to SLP grammars is not possible: in some cases, the size of the minimal SLP grammar can be Ω( log n log log n )-times larger than the size of the greedy (i.e., minimal) LZ77 parsing. For completeness, let us show this by repeating here the counting argument essentially used in [10,17].…”

Section: Lz77 Parsingsmentioning

confidence: 99%

“…The relations between the LZ77 parsing and the grammar-based compression, which naturally induces an LZ77 parsing [4], are not sufficiently understood: known upper and lower bounds on their sizes differ by an O(log log n) factor [8][9][10]. A better lower bound z b = Ω(bz log n), which would show that our main result is tight, even only for b = 2, would imply that the minimal grammar generating the string attaining this bound is of size Ω(z log n), thus removing the O(log log n)-factor gap.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Lempel-Ziv Parsing for Sequences of Blocks

Kosolobov

Valenzuela

2021

Algorithms

View full text Add to dashboard Cite

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b>1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b=8 in case of bytes) are related as zb=O(bzlognz). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb=O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods.

show abstract

“…The α-balanced grammar of Charikar et al [9] produces a (non-SLP) grammar of size O(g * log(N/g * )), where g * denotes the size of the smallest (non-SLP) grammar. Upper bounds and lower bounds for the approximation ratios of other practical grammar compressors including LZ78 [41], BISECTION [26], RePair [29], SEQUEN-TIAL [39], LONGEST MATCH [26], and GREEDY [1], are also known [9,2]. Charikar et al [9] showed that the approximation ratio of RePair to the smallest (non-SLP) grammar is at most O((N/ log N ) 2/3 ) and is at least Ω( √ log N ).…”

Section: Related Workmentioning

confidence: 99%

RePair Grammars are the Smallest Grammars for Fibonacci Words

Mieno¹,

Inenaga²,

Horiyama³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Grammar-based compression is a loss-less data compression scheme that represents a given string w by a context-free grammar that generates only w. While computing the smallest grammar which generates a given string w is NP-hard in general, a number of polynomial-time grammar-based compressors which work well in practice have been proposed. RePair, proposed by Larsson and Moffat in 1999, is a grammar-based compressor which recursively replaces all possible occurrences of a most frequently occurring bigrams in the string. Since there can be multiple choices of the most frequent bigrams to replace, different implementations of RePair can result in different grammars. In this paper, we show that the smallest grammars generating the Fibonacci words F k can be completely characterized by RePair, where F k denotes the k-th Fibonacci word. Namely, all grammars for F k generated by any implementation of RePair are the smallest grammars for F k , and no other grammars can be the smallest for F k . To the best of our knowledge, Fibonacci words are the first non-trivial infinite family of strings for which RePair is optimal.

show abstract

The Smallest Grammar Problem Revisited

Cited by 6 publications

References 39 publications

Re-Pair in Small Space

Re-Pair in Small Space

Lempel-Ziv Parsing for Sequences of Blocks

RePair Grammars are the Smallest Grammars for Fibonacci Words

Contact Info

Product

Resources

About