DOI: 10.1007/978-3-540-73437-6_22
|View full text |Cite
|
Sign up to set email alerts
|

Space-Efficient Algorithms for Document Retrieval

Abstract: Abstract. We study the Document Listing problem, where a collection D of documents d1, . . . , d k of total length i di = n is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in O(m + ndoc) time. In this paper, we improve the space-requirement of the Muthukrishnan's solution from O(n log n) bi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
61
0
1

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 61 publications
(64 citation statements)
references
References 12 publications
2
61
0
1
Order By: Relevance
“…We include three baseline methods derived from previous work on the document listing problem. The first two are implementations of Välimäki and Mäkinen [22] and Sadakane [20] as described in Section 3, labelled VM and Sada respectively. The third, ℓ-gram, is a close variant of Puglisi et al's inverted index of ℓ-grams [16], used with parameters ℓ = 3 and block size= 4096.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We include three baseline methods derived from previous work on the document listing problem. The first two are implementations of Välimäki and Mäkinen [22] and Sadakane [20] as described in Section 3, labelled VM and Sada respectively. The third, ℓ-gram, is a close variant of Puglisi et al's inverted index of ℓ-grams [16], used with parameters ℓ = 3 and block size= 4096.…”
Section: Methodsmentioning
confidence: 99%
“…By representing D with a wavelet tree, values C[i] can be calculated on demand, rather than stored explicitly [22]. This reduces the space to | CSA | + n log N + 2n + o(n log N ) bits, where | CSA | is the size of any compressed suffix array and N is the number of documents (Section 2).…”
Section: Document Listingmentioning
confidence: 99%
“…He used |CSA| + 4n + o(n) additional bits for data structures to compute the pattern's frequency in each document, increasing the time bound to O(search(m) + ndoc(lookup(n) + log log ndoc)) (assuming lookup(n) is also the time to find CSA −1 [ ], where CSA −1 is the inverse permutation). Välimäki and Mäkinen [37] gave an alternative slower-but-smaller version of Muthukrishnan's CRL data structure, in which they used a 2n + o(n) bit, O(1) time RMQ succinct index due to Fischer and Heun [13] that requires access to C. Välimäki and Mäkinen showed how access to C can be implemented by rank and select queries on S; specifically, for 1 ≤ ≤ n,…”
Section: Listingmentioning
confidence: 99%
“…The space bound is the sum of the space bounds and the time bound per reported color is O(t acc + t enum + t rank ), the latter term for computing frequencies. For example, 2+9: is Välimäki and Mäkinen's scheme [37]. 1: is the scheme by Gagie, Puglisi, and Turpin [15].…”
Section: Listingmentioning
confidence: 99%
See 1 more Smart Citation