Compact Indexes for Flexible Top-$$k$$ Retrieval

Gog, Simon; Petri, Matthias

doi:10.1007/978-3-319-19929-0_18

Cited by 4 publications

(5 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All programs were compiled with optimizations using g++ version 5.2.0 We are using test collections from the natural language domain, as character and as word sequence: two Wikipedia dumps of different size, a subset of publicly available Reddit comments 4 and all revisions of 100 Finnish Wikipedia articles (each revision is a single document). Additionally we use a word parsing of the TREC gov2 collection [7]. Table 1 tion and benchmarks are publicly available 5 and contain all parameters left out here due to space constrains.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Elias-Fano meets Single-Term Top-k Document Retrieval

Labeit¹,

Gog²

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

A fundamental problem in Information Retrieval is to determine the k most relevant documents of a collection for a given query word or phrase P . In a recent result, Navarro and Nekrich [SODA 2012] showed that this problem can be solved in optimal time complexity of O(|P | + k) with a precomputed linear-space index. The size of this optimal-time index was estimated to be 80 times the collection size, rendering it not to be practical. In subsequent work, Navarro and Konow [DCC 2013] and Gog and Navarro [ALENEX 2015] created a practical version with slightly worse query time guarantees but reduced the space to 2.5 − 3 times the collection size. The index is conceptually simple and is divided in five components. In this paper we show how the n log N bits required by the usually largest component -the so called repetition array -can be reduced to n log log n + O(n), where n is the size of the collection and N the number of documents. As the overall query time complexity matches the one of the old index, we achieve a theoretically superior time-space trade-off. We explore the practical properties of the improved index in a detailed experimental study and compare to the previously established baseline. Index sizes are now between 1.5 − 2 times the collection size while query speed is comparable to the larger indexes. We also show that the new approach automatically adapts to highly repetitive text collections, which are for instance produced by version control systems.

show abstract

Section: Methodsmentioning

confidence: 99%

“…E.g. v 13 is mapped to [7,11]. The repetitions of interest are restricted to (−∞, depth(v P ) − 1] in the y range; so (−∞, 1] in our example.…”

Section: The Basic Framework and Data Structuresmentioning

confidence: 99%

Elias-Fano meets Single-Term Top-k Document Retrieval

Labeit¹,

Gog²

2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

show abstract

“…All programs were compiled with optimizations using g++ version 4.9.0. We are using test collections from the natural language domain: two Wikipedia dumps of different size, parsed as character and as word sequences [9], and a word parsing of the TREC GOV2 collection [10]. Table 1 summarizes their properties.…”

Section: Methodsmentioning

confidence: 99%

“…Table 1 summarizes their properties. Our implementation and benchmarks are publicly available 3 as part of the SUccinct Retrieval Framework (SURF), which was introduced in [10]. The experiments can be easily reproduced by running the provided scripts.…”

Section: Methodsmentioning

confidence: 99%

Improved Single-Term Top-k Document Retrieval

Gog

Navarro

2014

2015 Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments (ALENEX)

Self Cite

View full text Add to dashboard Cite

On natural language text collections, finding the k documents most relevant to a query is generally solved with inverted indexes. On general string collections, however, more sophisticated data structures are necessary. Navarro and Nekrich [SODA 2012] showed that a linear-space index can solve such top-k queries in optimal time O(m + k), where m is the query length. Konow and Navarro [DCC 2013] implemented the scheme, managing to solve top-k queries within microseconds with an index using 3.3-4.0 bytes per character (this includes the storage of the collection itself). In this paper we introduce a new implementation using significantly less space, 2.5-3.0 bytes per character (again, including the collection), and retaining similar query times. For short queries, which are the most difficult, our new index actually outperforms the previous one, as well as all the other solutions in the literature. We also show that our index can be built on very large text collections, and that it can handle phrase queries efficiently on natural language text collections. In the latter case, it uses about the same space of the tokenized text (and replaces it), while answering phrase queries an order of magnitude faster than a positional inverted index.

show abstract

“…We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing [21] and information retrieval [10]. Efficiency for large alphabets has been to date unaddressed by previous studies on EM suffix sorting [6,7,4,12,18,13], in all of which a byte alphabet is assumed.…”

Section: Introductionmentioning

confidence: 97%

Engineering External Memory Induced Suffix Sorting

Kärkkäinen

Kempa

Puglisi

et al. 2017

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

Suffix sorting -determining the lexicographical order of all the suffixes of a string -is one of the most important problems in string processing. The resulting data structure is called the suffix array (SA) and underpins dozens of applications in bioinformatics, data compression, and information retrieval. When the size of the input string or the SA exceeds that of internal memory (RAM), an external memory (EM) suffix sorting algorithm must be used. The most scalable of these EM methods is due to Bingmann et al. (Proc. ALENEX 2013), and is essentially a careful disk-based implementation of the so-called induced sorting technique used by the fastest RAM suffix sorting algorithms.In this paper we show how to greatly improve the efficiency of induced suffix sorting in external memory via a non-trivial reorganization of the computation involved. Our experiments show this new approach to be twice as fast as state-of-the-art methods, while, just as significantly, using a third of the disk memory. We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing and information retrieval, but unaddressed by previous EM suffix sorting implementations.Our implementation uses a (EM) radix heap data structure and, as a side result of independent interest, we introduce a new operation for radix heaps and other monotone priority queues called min-comp, which we believe to be useful for many other applications, including discrete event simulation and sweep line algorithms, even in internal memory.

show abstract

Compact Indexes for Flexible Top-$$k$$ Retrieval

Cited by 4 publications

References 25 publications

Elias-Fano meets Single-Term Top-k Document Retrieval

Elias-Fano meets Single-Term Top-k Document Retrieval

Improved Single-Term Top-k Document Retrieval

Engineering External Memory Induced Suffix Sorting

Contact Info

Product

Resources

About