2020
DOI: 10.1145/3375890
|View full text |Cite
|
Sign up to set email alerts
|

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Abstract: Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarith… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
142
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 124 publications
(153 citation statements)
references
References 117 publications
(257 reference statements)
2
142
0
Order By: Relevance
“…as more assemblies from the Human Pan-Genome Reference Consortium [12] and similar projects emerge — SPUMONI is well positioned for sublinear index growth and a greater throughput advantage. For instance, the r -index underlying SPUMONI was previously shown to be able to index up to 10 human genomes with sublinear growth in the index size [13].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…as more assemblies from the Human Pan-Genome Reference Consortium [12] and similar projects emerge — SPUMONI is well positioned for sublinear index growth and a greater throughput advantage. For instance, the r -index underlying SPUMONI was previously shown to be able to index up to 10 human genomes with sublinear growth in the index size [13].…”
Section: Resultsmentioning
confidence: 99%
“…We use r to denote the number of maximal equal-letter runs of the BWT. The r -index [13] is a self-index which stores a run-length encoded BWT, i.e., each run is encoded as a character together with the run length.…”
Section: Methodsmentioning
confidence: 99%
“…The second component is the grammar G that generates DA [1..n], which must be binary and balanced. Such grammars can be built so as to ensure that their total size is O(r lg(n/r) lg n) bits [9], which is of the same order of the first component.…”
Section: Structurementioning
confidence: 99%
“…Grammar is way out of the plot, however, because it requires 1.2-3.4 milliseconds to solve the queries, that is, 205-235 times slower than GCDA. 9 The next smallest index is our variant Brute-C, which uses 0.35-0.55 bps and is generally smaller than GCDA, but slower by a factor of 2.6-6.7. Brute-L, occupying 0.38-0.60 bps, is also smaller in some cases, but much slower (180-1080 microseconds, out of the plot).…”
Section: Tuning Our Main Indexmentioning
confidence: 99%
“…The Burrows-Wheeler Transform (BWT) of a text T is a suitable permutation of the letters of T , and it has become a fundamental tool for the design of self-indexing data structures. This permutation has been intensively studied from a theoretical and combinatorial viewpoints [ 24 – 30 ] and has found important and successful applications in several areas in science and engineering [ 7 , 20 , 31 40 ], but so far it was not yet used per se for direct detection of variants. The eBWT is an extension of the BWT to collections of strings that has been introduced in [ 41 ].…”
Section: Introductionmentioning
confidence: 99%