Optimal-Time Text Indexing in BWT-runs Bounded Space

Gagie, Travis; Navarro, Gonzalo; Prezza, Nicola

doi:10.1137/1.9781611975031.96

Cited by 81 publications

(129 citation statements)

References 79 publications

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…Currently, SA-scan uses n log n + n log σ bits of space for a text of length n on alphabet σ for the suffix array and text, respectively (the WT approach of Bader et al, uses slightly more). Because our methods consist (mostly) of simple scans of SA ranges or scans of the underlying text, they are easily translated to make use of recent results on Burrows-Wheeler-based compressed indexes [10] that allow fast access to elements of the suffix array from a compressed representation of it. Via this observation we derive the first compressed indexes for VLG matching.…”

Section: Discussionmentioning

confidence: 99%

Fast Indexes for Gapped Pattern Matching

Cáceres

Puglisi

Zhukova

2020

SOFSEM 2020: Theory and Practice of Computer Science

View full text Add to dashboard Cite

We describe indexes for searching large data sets for variablelength-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.

show abstract

Section: Discussionmentioning

confidence: 99%

Fast Indexes for Gapped Pattern Matching

Cáceres

Puglisi

Zhukova

2020

SOFSEM 2020: Theory and Practice of Computer Science

View full text Add to dashboard Cite

show abstract

“…The Burrows-Wheeler Transform (BWT) [2] and FM-index [3] are central to the most popular shortread aligners, such as BWA [9] and Bowtie [8,7], but until recently it was not known how to apply these concepts effectively to whole genomic databases. Building on previous authors' work [11], Gagie, Navarro and Prezza [4] described how a fully functional variant of the FM-index for such a database could be stored in reasonable space: their variant takes O(r) machine words, where r is the number of runs in the BWT of the database, and thus is called the r-index. Prezza [14] gave a preliminary implementation, which was significantly extended by Boucher et al [1] and Kuhnle et al [6].…”

Section: Introductionmentioning

confidence: 99%

Matching Reads to Many Genomes with the r-Index

Mun

Kuhnle

Boucher

et al. 2020

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align ; how to call ri-buildfasta on a FASTA file to build an r-index for that file; and how to query that index with ri-align .Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index.

show abstract

“…With the TOPMed dataset, we achieve similar proportions by storing the identifiers at one out of 16,384 positions, making locate() 16 to 22 times slower. There is a theoretical proposal for supporting fast locate() queries in space proportional to the size of the run-length encoded BWT (Gagie et al, 2018). While there has been some progress in building the proposed index for large datasets (Kuhnle et al, 2019), scaling it up to TOPMed scale is still an open problem.…”

mentioning

confidence: 99%

Haplotype-aware graph indexes

Sirén

Garrison

Novak

et al. 2019

Preprint

View full text Add to dashboard Cite

Motivation:The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes. Results:We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

show abstract

Optimal-Time Text Indexing in BWT-runs Bounded Space

Cited by 81 publications

References 79 publications

Fast Indexes for Gapped Pattern Matching

Fast Indexes for Gapped Pattern Matching

Matching Reads to Many Genomes with the r-Index

Haplotype-aware graph indexes

Contact Info

Product

Resources

About