Space-Efficient Algorithms for Document Retrieval

Välimäki, Niko; Mäkinen, Veli

doi:10.1007/978-3-540-73437-6_22

Cited by 61 publications

(64 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We include three baseline methods derived from previous work on the document listing problem. The first two are implementations of Välimäki and Mäkinen [22] and Sadakane [20] as described in Section 3, labelled VM and Sada respectively. The third, ℓ-gram, is a close variant of Puglisi et al's inverted index of ℓ-grams [16], used with parameters ℓ = 3 and block size= 4096.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Top-k Ranked Document Search in General Text Databases

Culpepper

Navarro

Puglisi

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

show abstract

Section: Methodsmentioning

confidence: 99%

“…By representing D with a wavelet tree, values C[i] can be calculated on demand, rather than stored explicitly [22]. This reduces the space to | CSA | + n log N + 2n + o(n log N ) bits, where | CSA | is the size of any compressed suffix array and N is the number of documents (Section 2).…”

Section: Document Listingmentioning

confidence: 99%

Top-k Ranked Document Search in General Text Databases

Culpepper

Navarro

Puglisi

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…He used |CSA| + 4n + o(n) additional bits for data structures to compute the pattern's frequency in each document, increasing the time bound to O(search(m) + ndoc(lookup(n) + log log ndoc)) (assuming lookup(n) is also the time to find CSA −1 [ ], where CSA −1 is the inverse permutation). Välimäki and Mäkinen [37] gave an alternative slower-but-smaller version of Muthukrishnan's CRL data structure, in which they used a 2n + o(n) bit, O(1) time RMQ succinct index due to Fischer and Heun [13] that requires access to C. Välimäki and Mäkinen showed how access to C can be implemented by rank and select queries on S; specifically, for 1 ≤ ≤ n,…”

Section: Listingmentioning

confidence: 99%

“…The space bound is the sum of the space bounds and the time bound per reported color is O(t acc + t enum + t rank ), the latter term for computing frequencies. For example, 2+9: is Välimäki and Mäkinen's scheme [37]. 1: is the scheme by Gagie, Puglisi, and Turpin [15].…”

Section: Listingmentioning

confidence: 99%

“…In this paper, motivated by problems in document retrieval, we consider the latter three kinds of problems, which are often referred to as "colored" range queries: colored range listing (with or without color frequencies), colored range top-k queries, and colored range counting. These have been associated, respectively, to very relevant document retrieval queries on general texts [31,35,37,20,15,12,9]: listing the documents where a pattern appears (possibly computing term frequencies), finding the most relevant documents to a query (under a tf × idf scheme, for example), and computing document frequencies. Such techniques have been shown to be competitive [9], even beating classical inverted indexes on natural-language texts.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Colored Range Queries and Document Retrieval

Gagie

Navarro

Puglisi

2010

String Processing and Information Retrieval

View full text Add to dashboard Cite

Abstract. Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper we give improved time and space bounds for three important one-dimensional colored range queries -colored range listing, colored range top-k queries and colored range counting -and, thus, new bounds for various document retrieval problems on general collections of sequences. Specifically, we first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the fastest compressed data structures for colored range listing and document listing, and an efficient data structure for document listing whose size is bounded in terms of the high-order entropies of the library of documents. We then show how (approximate) colored top-k queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how a modified wavelet tree can support colored range counting in logarithmic time and space that is succinct whenever the number of colors is superpolylogarithmic in the length of the sequence.

show abstract