Time-Optimal Top-$k$ Document Retrieval

Navarro, Gonzalo; Nekrich, Yakov

doi:10.1137/140998949

Cited by 21 publications

(26 citation statements)

References 72 publications

(101 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By using perfect hashing to store the first characters of the edge labels descending from each node of v, we reach the locus in optimal time O(m) and the space is still O(n). If P comes packed using w/ log σ symbols per computer word, we can descend in time O( m log(σ)/w ) [91], which is optimal in the packed model. In the suffix array, all the suffixes starting with P form a range SA[sp..ep], which can be binary searched in time O(m log n), or O(m + log n) with additional structures [81].…”

Section: Suffix Trees and Arraysmentioning

confidence: 99%

See 1 more Smart Citation

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

123

142

View full text Add to dashboard Cite

Indexing highly repetitive texts -such as genomic databases, software repositories and versioned text collections -has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitivenessthe number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating (only) the text, the size of the smallest automaton recognizing the text factors -have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O( m log(σ)/w ) and O( m log(σ)/w + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length in almost-optimal time O(log(n/r) + log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude.

show abstract

Section: Suffix Trees and Arraysmentioning

confidence: 99%

“…We then replace our trie by a more sophisticated structure, which is described by Navarro and Nekrich [91,Sec. 2], built on the O(rs) distinct strings of length s. Let d = w/ log σ .…”

Section: Ram-optimal Counting and Locatingmentioning

confidence: 99%

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Gagie

Navarro

Prezza

2020

J. ACM

Self Cite

123

142

View full text Add to dashboard Cite

show abstract

“…In the RAM model with word size Θ(log n), and if the consecutive symbols of P come packed into |P |/ log σ n words, the optimal time is instead O(|P |/ log σ n). This optimal time was recently reached by Navarro and Nekrich [31] (note that their time is not optimal if w = ω(log n)), with a simple application of weak-prefix search, already hinted in the original article [2]. However, even the randomized construction time of the weak-prefix search structure is O(n log n), for any constant > 0.…”

Section: Compact Uncompressedmentioning

confidence: 97%

“…Compared with previous work, other indexes may be faster at counting, but either they are not built in linear deterministic time [5,19,31] or they are not compressed [31,7]. Our index outperforms all the previous compressed [13,1,6], as well as some uncompressed [15], indexes that can be built deterministically.…”

Section: Introductionmentioning

confidence: 93%

“…As an application of our tools, we also show that an index using O(n log σ) bits of space can be built in linear deterministic time, so that it can count in time O(|P |/ log σ n + log n(log log n) 2 ), which is RAM-optimal for w = Θ(log n) and sufficiently long patterns. Current indexes obtaining similar counting time require O(n log σ) construction time [19] or higher [31], or O(n log n) bits of space [31,7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

2019

Self Cite

View full text Add to dashboard Cite

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size σ, can be built in O(n) deterministic time, within O(n log σ) bits of working space, and counts the number of occurrences of any pattern P in T in time O(|P | + log log w σ) on a RAM machine of w = Ω(log n)-bit words. This new index outperforms all the other compressed indexes that can be built in linear deterministic time, and some others. The only faster indexes can be built in linear time only in expectation, or require Θ(n log n) bits. We also show that, by using O(n log σ) bits, we can build in linear time an index that counts in time O(|P |/ log σ n + log n(log log n) 2 ), which is RAM-optimal for w = Θ(log n) and sufficiently long patterns.

show abstract

Contextual Pattern Matching

Navarro

2020

String Processing and Information Retrieval

View full text Add to dashboard Cite

Time-Optimal Top-$k$ Document Retrieval

Cited by 21 publications

References 72 publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Contextual Pattern Matching

Contact Info

Product

Resources

About