Universal indexes for highly repetitive document collections

Claude, Francisco; Fariña, Antonio; Martínez‐Prieto, Miguel A.; Navarro, Gonzalo

doi:10.1016/j.is.2016.04.002

Cited by 30 publications

(26 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compared r-index with the state-of-the-art index for each compressibility measure: lzi 23 [73,24] (z), slp 23 [25,24] (g), rlcsa 24 [79,80] (r), and cdawg 25 [9] (e). We also included hyb 26 [30,31], which combines a Lempel-Ziv index with an FM-index, with parameter M = 8, which is optimal for our experiment.…”

Section: Methodsmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

118

142

View full text Add to dashboard Cite

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitivenessthe number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating (only) the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O( m log(σ)/w ) and O( m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length in almost-optimal time O(log(n/r) + log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude.

show abstract

Section: Methodsmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

118

142

View full text Add to dashboard Cite

show abstract

“…Numerous variants of LZ77 have been developed and several widely used implementations are available (such as gzip [1]). Recently, LZ77 has been shown to be particularly effective at handling highly-repetitive data sets [31,33,28,10,4] and LZ77 compression is always at least as powerful as any grammar representation [38,9].…”

Section: Introductionmentioning

confidence: 99%

Time–space trade-offs for Lempel–Ziv compressed indexing

Bille¹,

Ettienne²,

Gørtz³

et al. 2018

Theoretical Computer Science

View full text Add to dashboard Cite

Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabetswhere n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ǫ > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lg lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lg ǫ z lg(n/z) )). However, for any polynomial compression ratio, i.e., z = O(n 1−δ ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ℓ in O(ℓ + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

show abstract

“…All serious short read aligners (see e.g., [19,20,6]) implement a kind of compressed suffix array [23] called the FM-index [5], usually in n log σ bits of space (2n bits on DNA data), or even near nH k bits, where H k is the kth-order empirical entropy of the underlying data 1 . However, as genome sequencing becomes cheaper, and the number of genome sequences in genomic databases grows, indexes of linear size (even those taking around nH k bits) quickly become too large.…”

Section: Introductionmentioning

confidence: 99%

Hybrid Indexing Revisited

Ferrada

Kempa

Puglisi

2018

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

Hybrid indexing is a recent approach to text indexing that allows the space-usage of conventional text indexes (e.g., suffix trees, suffix arrays, FM-indexes) to scale well with the text size, n, when z, the size of the Lempel-Ziv parsing of the text, is small relative to n. The price for this improved scalability is that an upper bound M on the pattern length that can be searched for must be declared at index construction time. Because the size of the resulting index contains an O(M z) term, M must be kept reasonably small, though it has been shown that M ≈ 100 leads to acceptable performance in some genomic applications. However, despite its promise, the practical performance of hybrid indexing relative to other compressed index data structures is poorly understood. This paper addresses that need, detailing experiments that show hybrid indexing -when carefully implemented -to be significantly smaller and faster than alternative approaches on a broad range of data of different levels of compressibility. We also describe practical extensions to hybrid indexing that obviate the restriction on M , supporting search for patterns of arbitrary length.

show abstract

Universal indexes for highly repetitive document collections

Cited by 30 publications

References 64 publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Time–space trade-offs for Lempel–Ziv compressed indexing

Hybrid Indexing Revisited

Contact Info

Product

Resources

About