CHICO: A Compressed Hybrid Index for Repetitive Collections

Valenzuela, Daniel

doi:10.1007/978-3-319-38851-9_22

“…We test our implementations on five DNA datasets from the Pizza&Chili repetitive corpus 8 , which include the whole genomes of approximately 36 strains of the same eukaryotic species, a collection of 23 and approximately 78 thousand substrings of the genome of the same bacterium, and an artificially repetitive with the same sampling rates, to the five variants in the implementation of the LZ77 index described in [14], and to a recent implementation of the compressed hybrid index [22]. The FM index uses RRR bitvectors in its wavelet tree.…”

Section: Resultsmentioning

confidence: 99%

Flexible Indexing of Repetitive Collections

Belazzougui

¹

,

Cunial

²

,

Gagie

³

et al. 2017

Unveiling Dynamics and Complexity

View full text Add to dashboard Cite

Abstract. Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA.

show abstract

“…The first non-trivial implementation detail is that in our implementation we employ the idea described in [28] to reduce the number of LZ phrases. Namely, for any maximal sequence of adjacent phrases where each phrase has length ≤ M , we merge them into one superphrase.…”

Section: Implementation Detailsmentioning

confidence: 99%

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

Pagh¹,

Venkatasubramanian²

2018

0

View full text Add to dashboard Cite

Hybrid indexing is a recent approach to text indexing that allows the space-usage of conventional text indexes (e.g., suffix trees, suffix arrays, FM-indexes) to scale well with the text size, n, when z, the size of the Lempel-Ziv parsing of the text, is small relative to n. The price for this improved scalability is that an upper bound M on the pattern length that can be searched for must be declared at index construction time. Because the size of the resulting index contains an O(M z) term, M must be kept reasonably small, though it has been shown that M ≈ 100 leads to acceptable performance in some genomic applications. However, despite its promise, the practical performance of hybrid indexing relative to other compressed index data structures is poorly understood. This paper addresses that need, detailing experiments that show hybrid indexing -when carefully implemented -to be significantly smaller and faster than alternative approaches on a broad range of data of different levels of compressibility. We also describe practical extensions to hybrid indexing that obviate the restriction on M , supporting search for patterns of arbitrary length.

show abstract

“…Valenzuela [28] has since demonstrated hybrid indexing to be very effective in practice for indexing massive genomic data sets (in the terabyte range), and the technique now underlies tools for detecting genomic variants in pangenomic data [29]. However, Valenzuela's index is tightly coupled to the DNA alphabet and still carries the restriction that the maximum searchable pattern length is M , meaning it cannot be applied to long, so-called third generation DNA sequence reads (see, e.g.…”

Section: Introductionmentioning

confidence: 99%

Hybrid Indexing Revisited

Ferrada

¹

,

Kempa

²

,

Puglisi

³

2018

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

Hybrid indexing is a recent approach to text indexing that allows the space-usage of conventional text indexes (e.g., suffix trees, suffix arrays, FM-indexes) to scale well with the text size, n, when z, the size of the Lempel-Ziv parsing of the text, is small relative to n. The price for this improved scalability is that an upper bound M on the pattern length that can be searched for must be declared at index construction time. Because the size of the resulting index contains an O(M z) term, M must be kept reasonably small, though it has been shown that M ≈ 100 leads to acceptable performance in some genomic applications. However, despite its promise, the practical performance of hybrid indexing relative to other compressed index data structures is poorly understood. This paper addresses that need, detailing experiments that show hybrid indexing -when carefully implemented -to be significantly smaller and faster than alternative approaches on a broad range of data of different levels of compressibility. We also describe practical extensions to hybrid indexing that obviate the restriction on M , supporting search for patterns of arbitrary length.

show abstract

CHICO: A Compressed Hybrid Index for Repetitive Collections

Cited by 17 publications

References 29 publications

Flexible Indexing of Repetitive Collections

Flexible Indexing of Repetitive Collections

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

Hybrid Indexing Revisited

Contact Info

Product

Resources

About