Efficient in-memory top-k document retrieval

Culpepper, J. Shane; Petri, Matthias; Scholer, Falk

doi:10.1145/2348283.2348317

Cited by 27 publications

(17 citation statements)

References 42 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our compact framework is based on encoding these pointers in smaller amount of bits, while the compressed framework further samples these pointers as they pass through some specially chosen nodes. These frameworks are fairly general and have also been shown to be practical [Patil et al 2011;Culpepper et al 2012;Belazzougui et al 2013]. Even though efficient solutions are already available for the central problem, there are still many interesting variations and open questions one could ask about.…”

Section: Resultsmentioning

confidence: 99%

“…Navarro and Nekrich [2012b] gave an index of size O(n(log σ + log D)) bits index with optimal O( p+ k) time; however, the hidden constants within the big-O notations are not small in practice [Konow and Navarro 2013]. It has been shown that, compact space indexes provide the best practical performance [Konow and Navarro 2013;Culpepper et al 2010] compared to linear space indexes [Patil et al 2011] (which are less efficient in terms of space occupancy) and the succinct space indexes [Culpepper et al 2012; (which are less efficient in terms of query processing time). See also Hsu and Ottaviano [2013] for a related result on top-k completion.…”

Section: Postscriptmentioning

confidence: 99%

See 1 more Smart Citation

Space-Efficient Frameworks for Top- k String Retrieval

Hon

Shah

Thankachan

et al. 2014

J. ACM

View full text Add to dashboard Cite

The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string-which can be a partial word, multiword phrase, or more generally any sequence of characters-then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D = {d 1 , d 2 , d 3 , . . . , d D } of D strings with n characters in total taken from an alphabet set = [σ ], and the task of the search engine, for a given query pattern P of length p, is to report the "most relevant" strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P, d r ), which indicates how relevant document d r is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document.The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan [SODA 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highestscoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O( p + k log k) query time. The query time can be made optimal O( p + k) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Postscriptmentioning

confidence: 99%

Space-Efficient Frameworks for Top- k String Retrieval

Hon

Shah

Thankachan

et al. 2014

J. ACM

View full text Add to dashboard Cite

show abstract

“…Their index, on the other hand, turns out to be very competitive in practice. There exist several other indexes of practical interest [22,6,23,24,25].…”

Section: Sourcementioning

confidence: 99%

New space/time tradeoffs for top- k document retrieval on sequences

Navarro

Thankachan

2014

Theoretical Computer Science

View full text Add to dashboard Cite

We address the problem of indexing a collection D = {T 1 , T 2 , ...T D } of D string documents of total length n, so that we can efficiently answer top-k queries: retrieve k documents most relevant to a pattern P of length p given at query time.There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p + k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T i ), an index occupying |CSA|+o(n) bits answers the query in time O(t search (p)+k lg 2 k lg ε n), where CSA is a compressed suffix array indexing D, t search is its time to find the suffix array interval of P, and ε > 0 is any constant. (2) With the same measure of relevance, an index occupying |CSA| + n lg D + o(n lg σ + n lg D) bits answers the query in time O(t search (p) + k lg * k), where lg * k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA| + O(n lg lg n) bits answers the query in O(t search (p) + k t SA ) time, where t SA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.

show abstract

“…We compare their best performing variant, GREEDY, in this paper. Culpepper et al [5] adapted the scheme to large natural language text collections (where each word is taken as an atomic symbol), showing that it was competitive with inverted indexes for some queries (see previous work on this line by Patil et al [20]). The seminal work of Hon et al [13] also included succinct variants, which were implemented by Navarro and Valenzuela [19] on top of a compressed representation of D.…”

Section: Basic Conceptsmentioning

confidence: 99%

Improved Single-Term Top-k Document Retrieval

Gog

Navarro

2014

2015 Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

On natural language text collections, finding the k documents most relevant to a query is generally solved with inverted indexes. On general string collections, however, more sophisticated data structures are necessary. Navarro and Nekrich [SODA 2012] showed that a linear-space index can solve such top-k queries in optimal time O(m + k), where m is the query length. Konow and Navarro [DCC 2013] implemented the scheme, managing to solve top-k queries within microseconds with an index using 3.3-4.0 bytes per character (this includes the storage of the collection itself). In this paper we introduce a new implementation using significantly less space, 2.5-3.0 bytes per character (again, including the collection), and retaining similar query times. For short queries, which are the most difficult, our new index actually outperforms the previous one, as well as all the other solutions in the literature. We also show that our index can be built on very large text collections, and that it can handle phrase queries efficiently on natural language text collections. In the latter case, it uses about the same space of the tokenized text (and replaces it), while answering phrase queries an order of magnitude faster than a positional inverted index.

show abstract

Efficient in-memory top-k document retrieval

Cited by 27 publications

References 42 publications

Space-Efficient Frameworks for Top- k String Retrieval

Space-Efficient Frameworks for Top- k String Retrieval

New space/time tradeoffs for top- k document retrieval on sequences

Improved Single-Term Top-k Document Retrieval

Contact Info

Product

Resources

About