Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie, Travis; Navarro, Gonzalo; Prezza, Nicola

doi:10.1145/3375890

Cited by 124 publications

(153 citation statements)

References 117 publications

(257 reference statements)

Supporting

Mentioning

142

Contrasting

Order By: Relevance

“…as more assemblies from the Human Pan-Genome Reference Consortium [12] and similar projects emerge — SPUMONI is well positioned for sublinear index growth and a greater throughput advantage. For instance, the r -index underlying SPUMONI was previously shown to be able to index up to 10 human genomes with sublinear growth in the index size [13].…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Ahmed

Rossi

Kovaka

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject “non-target” DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing with the help of efficient pangenome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics (half-maximal exact matches) in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI’s index and peak memory footprint are also 15 to 4 times smaller than minimap2, respectively. These improvements become even more pronounced with even larger reference databases; SPUMONI’s index size scales sublinearly with the number of reference genomes included. This could enable accurate targeted sequencing even in the case where the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI is open source software available from https://github.com/oma219/spumoni.

show abstract

Section: Resultsmentioning

confidence: 99%

“…We use r to denote the number of maximal equal-letter runs of the BWT. The r -index [13] is a self-index which stores a run-length encoded BWT, i.e., each run is encoded as a character together with the run length.…”

Section: Methodsmentioning

confidence: 99%

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Ahmed

Rossi

Kovaka

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The second component is the grammar G that generates DA [1..n], which must be binary and balanced. Such grammars can be built so as to ensure that their total size is O(r lg(n/r) lg n) bits [9], which is of the same order of the first component.…”

Section: Structurementioning

confidence: 99%

“…Grammar is way out of the plot, however, because it requires 1.2-3.4 milliseconds to solve the queries, that is, 205-235 times slower than GCDA. 9 The next smallest index is our variant Brute-C, which uses 0.35-0.55 bps and is generally smaller than GCDA, but slower by a factor of 2.6-6.7. Brute-L, occupying 0.38-0.60 bps, is also smaller in some cases, but much slower (180-1080 microseconds, out of the plot).…”

Section: Tuning Our Main Indexmentioning

confidence: 99%

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Cobas

Navarro

2019

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m + ndoc · lg n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.Muthunkishnan [20] designed the first linear-space and optimal-time index for general string collections. Given a collection of total length n, he builds an index of O(n) words that lists the ndoc documents where a pattern of length m appears in time O(m + ndoc). While linear space is deemed as sufficiently small in classic scenarios, the solution is impractical for very large text collections unless one resorts to disk, which is orders of magnitude slower. Sadakane [26] showed how to reduce the space of Muthukrishnan's index to that of the statistically-compressed text plus O(n) bits, while raising the time complexity to only O(m + ndoc · lg n) if the appropriate underlying pattern-matching index is used [2].The sharp growth of text collections is a concern in many recent applications, outperforming Moore's Law in some cases [27]. Fortunately, many of the fastest-growing text collections are highly repetitive: each document can be obtained from a few large blocks of other documents. These collections arise in different areas, such as repositories of genomes of the same species (which differ from each other by a small percentage only) like the 100K-genome project 1 , software repositories that store all the versions of the code arranged in a tree or acyclic graph like GitHub 2 , versioned document repositories where each document has a timeline of versions like Wikipedia 3 , etc. On such text collections, statistical compression is ineffective [14] and even O(n) bits of extra space can be unaffordable.Repetitiveness is the key to tackle the fast growth of these collections: their amount of new material grows much slower than their size. For example, version control systems compress those collections by storing the list of edits with respect to some reference document that is stored in plain form, and reconstruct it by applying the edits to the reference version. Much more challenging, however, is to index those ...

show abstract

“…The Burrows-Wheeler Transform (BWT) of a text T is a suitable permutation of the letters of T , and it has become a fundamental tool for the design of self-indexing data structures. This permutation has been intensively studied from a theoretical and combinatorial viewpoints [ 24 – 30 ] and has found important and successful applications in several areas in science and engineering [ 7 , 20 , 31 – 40 ], but so far it was not yet used per se for direct detection of variants. The eBWT is an extension of the BWT to collections of strings that has been introduced in [ 41 ].…”

Section: Introductionmentioning

confidence: 99%

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

et al. 2020

Self Cite

View full text Add to dashboard Cite

Background In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.

show abstract

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Cited by 124 publications

References 117 publications

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

Contact Info

Product

Resources

About