A grammar compressor for collections of reads with applications to the construction of the BWT

Díaz-Domínguez, Diego; Navarro, Gonzalo

doi:10.1109/dcc50243.2021.00016

Cited by 10 publications

(8 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There, we measure the time for count(P ) with |P | = 2 x for each x ∈ [8..15]. For each data point and each dataset T , we extract 2 12 random samples of equal length from T , perform the query for each sample, and measure the average time per character. 5 From Fig.…”

Section: Methodsmentioning

confidence: 99%

“…They applied a grammar compression merging frequent bigrams similar to Re-Pair [28], and empirically could improve the computation of the BWT as well as the reconstruction of the text from the BWT. With a similar target, Díaz-Domínguez and Navarro [12,13] computed the extended BWT [31], a BWT variant for multiple texts, from the GCIS grammar.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

Deng¹,

Hon²,

Köppl³

et al. 2021

Preprint

View full text Add to dashboard Cite

The run-length compressed Burrows-Wheeler transform (RLBWT) used in conjunction with the backward search introduced in the FM index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological sequences. Compared to grammar indexes, the size of the RLBWT is often much bigger, but queries like counting the occurrences of long patterns can be done much faster than on any existing grammar index so far. In this paper, we combine the virtues of a grammar with the RLBWT by building the RLBWT on top of a special grammar based on induced suffix sorting. Our experiments reveal that our hybrid approach outperforms the classic RLBWT with respect to the index sizes, and with respect to query times on biological data sets for sufficiently long patterns.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

Deng¹,

Hon²,

Köppl³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The experiments showed that the r -index outperforms all the other implemented indexes by orders of magnitude in space or in time to locate pattern occurrences on highly repetitive datasets. However, other experiments on more typical repetitiveness scenarios [23,5,6,1] showed that the space of the r -index degrades very quickly as repetitiveness decreases. For example, a grammar-based index (which can be of size g = O(z log(n/z))) is usually slower but significantly smaller [5], and an even slower Lempel-Ziv based index of size O(z) [15] is even smaller.…”

Section: Introductionmentioning

confidence: 94%

“…However, r degrades faster than z as repetitiveness drops: in an experiment on bacterial genomes in the same article, where n/r ≈ 100, the r -index space approaches 0.9 bps, 4 times that of the lz-index; r also approaches 4z. Experiments on other datasets show that the r -index tends to be considerably larger [23,5,6,1]. 1 Indeed, in some realistic cases n/r can be over 1,500, but in most cases it is well below: 40-160 on versioned software and document collections and fully assembled human chromosomes, 7.5-50 on virus and bacterial genomes (with r in the range 4z-7z), and 4-9 on sequencing reads; see Section 5.…”

Section: Introductionmentioning

confidence: 99%

A Fast and Small Subsampled R-index

Cobas,

Gagie,

Navarro

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The r -index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, O(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr -index, a variant that limits the space to O(min(r, n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr -index is obtained by carefully subsampling the text positions indexed by the r -index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr -index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r -index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr -index, using about half the space, but they are an order of magnitude slower.

show abstract

“…Nunes et al [33] showed how to compute the suffix array and the longest-common-prefix array from GCIS during a decompression step restoring the original text. Recently, Díaz-Domínguez and Navarro [10] show how to compute the BWT directly from the GCIS grammar.…”

Section: Related Workmentioning

confidence: 99%

Grammar Index By Induced Suffix Sorting

Akagı¹,

Köppl²,

Nakashima³

et al. 2021

Preprint

View full text Add to dashboard Cite

Pattern matching is the most central task for text indices. Most recent indices leverage compression techniques to make pattern matching feasible for massive but highly-compressible datasets. Within this kind of indices, we propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC'18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P , certain substrings of P , called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P . Supported by the cores, given a pattern of length m, we can locate all its occ occurrences in a text T of length n within O(m lg |S| + occC lg |S| lg n + occ) time, where S is the set of all characters and nonterminals, occ is the number of occurrences, and occC is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T . Our grammar index requires O(g) words of space and can be built in O(n) time using O(g) working space, where g is the sum of the right hand sides of all production rules. We underline the strength of our grammar index with an exhaustive practical evaluation that gives evidence that our proposed solution excels at locating long patterns in highlyrepetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.

show abstract

A grammar compressor for collections of reads with applications to the construction of the BWT

Cited by 10 publications

References 24 publications

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

A Fast and Small Subsampled R-index

Grammar Index By Induced Suffix Sorting

Contact Info

Product

Resources

About