Towards a Definitive Measure of Repetitiveness

Kociumaka, Tomasz; Navarro, Gonzalo; Prezza, Nicola

doi:10.1007/978-3-030-61792-9_17

Cited by 40 publications

(77 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Examples of those measures include (but are not limited to) the number z of factors in the LZ77 factorization [21], the number g of rules in the smallest context-free grammar generating the word [17], the size b of the smallest bidirectional macro scheme [26], and the size e of the CDAWG [4]. More recently, it was shown that all those compressors are particular cases of a combinatorial object named string attractor [16] whose size γ lower-bounds all measures r, z, g, b, and e. In turn, in [19] it was shown that γ is lower-bounded by another measure, δ, which is linked to factor complexity (that is, to the number of distinct factors of each length) and better captures the word's repetitiveness. On the upper-bound side, the papers [16,19] provided approximation ratios of all measures but r with respect to γ.…”

Section: Introductionmentioning

confidence: 99%

“…More recently, it was shown that all those compressors are particular cases of a combinatorial object named string attractor [16] whose size γ lower-bounds all measures r, z, g, b, and e. In turn, in [19] it was shown that γ is lower-bounded by another measure, δ, which is linked to factor complexity (that is, to the number of distinct factors of each length) and better captures the word's repetitiveness. On the upper-bound side, the papers [16,19] provided approximation ratios of all measures but r with respect to γ. Finding an upper-bound for r remained an open problem until the recent work of Kempa and Kociumaka [15], who showed that, for any word of length n, r = O(δ log 2 n) (which in turn implies r = O(γ log 2 n)).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Giuliani

Inenaga

Lipták

et al. 2021

SOFSEM 2021: Theory and Practice of Computer Science

Self Cite

View full text Add to dashboard Cite

The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letterruns (usually referred to as r) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the r parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures.In this paper, we investigate ρ(v), the ratio of r and of the number of runs of the BWT of the reverse of v. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as ρ(v) = O(log 2 (n)), for any string v of length n. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which ρ(v) = Θ(log n) holds, thus giving the first non-trivial lower bound on ρ(n), the maximum over all strings of length n.Our results suggest that r is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Giuliani

Inenaga

Lipták

et al. 2021

SOFSEM 2021: Theory and Practice of Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, there are indexes based on LZ77 [37], RLBWT [17], and grammarbased compression [11]. Although recent studies [33,36,45] have investigated the fundamentals of these techniques and obtained a unified view of the compressibility of highly repetitive data, each compressed format still has pros and cons that cannot be ignored in practice. LZ77 usually achieves better compression than other compression methods, the index based on RLBWT (called r -index) supports very fast pattern search, and grammar-based compression is easy to handle in both theory and practice.…”

Section: Restructuring Compressed Datamentioning

confidence: 99%

Information Processing on Compressed Data

Takabatake

Tomohiro

Sakamoto

2021

Sublinear Computation Paradigm

View full text Add to dashboard Cite

We survey our recent work related to information processing on compressed strings. Note that a “string” here contains any fixed-length sequence of symbols and therefore includes not only ordinary text but also a wide range of data, such as pixel sequences and time-series data. Over the past two decades, a variety of algorithms and their applications have been proposed for compressed information processing. In this survey, we mainly focus on two problems: recompression and privacy-preserving computation over compressed strings. Recompression is a framework in which algorithms transform a given compressed data into another compressed format without decompression. Recent studies have shown that a higher compression ratio can be achieved at lower cost by using an appropriate recompression algorithm such as preprocessing. Furthermore, various privacy-preserving computation models have been proposed for information retrieval, similarity computation, and pattern mining.

show abstract

“…Both new measures better capture the compressibility of repetitive strings. It has been proved that δ ≤ γ ≤ z = O(δ lg n δ ) [7,8]. In this paper, we design the first string attractor based indexes (, which is also workable upon LZ-parsing) to support computation of the matching statistics with space cost measured by γ and δ.…”

Section: Introductionmentioning

confidence: 99%

“…To access the text T [1..n] within compressed space, we apply the string indexing data structure by Kociumaka at al. [8] with space cost measured by δ. We give a simple and practical algorithm that reduces the problem of computing MS into O(m 2 ) times of 2D orthogonal range predecessor queries upon γ points on the grid.…”

Section: Introductionmentioning

confidence: 99%

Computing Matching Statistics on Repetitive Texts

Gao¹

2021

Preprint

View full text Add to dashboard Cite

Computing the matching statistics of a string P [1..m] with respect to a text T [1..n] is a fundamental problem which has application to genome sequence comparison. In this paper, we study the problem of computing the matching statistics upon highly repetitive texts. We design three different data structures that are similar to LZ-compressed indexes. The space costs of all of them can be measured by γ, the size of the smallest string attractor [STOC'2018] and δ, a better measure of repetitiveness [LATIN'2020].

show abstract

Towards a Definitive Measure of Repetitiveness

Cited by 40 publications

References 37 publications

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Information Processing on Compressed Data

Computing Matching Statistics on Repetitive Texts

Contact Info

Product

Resources

About