Space-Efficient Frameworks for Top-
            <i>k</i>
            String Retrieval

Hon, Wing-Kai; Shah, Rahul; Thankachan, Sharma V.; Vitter, Jeffrey Scott

doi:10.1145/2590774

Cited by 29 publications

(3 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is also worth mentioning that the journal version of the original paper of Hon et al has recently appeared as well [40]. Here they show how to obtain O(p + k) time if the top-k results are not to be returned sorted by relevance.…”

Section: Discussionmentioning

confidence: 90%

Time-Optimal Top-$k$ Document Retrieval

Navarro¹,

Nekrich²

2017

SIAM J. Comput.

View full text Add to dashboard Cite

Let D be a collection of D documents, which are strings over an alphabet of size σ, of total length n. We describe a data structure that uses linear space and and reports k most relevant documents that contain a query pattern P , which is a string of length p packed in p/ log σ n words, in time O(p/ log σ n + k). This is optimal in the RAM model in the general case where log D = Θ(log n), and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures, such as the number of times P appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of P in a document. When log D = o(log n), we show how to reduce the space of the data structure from O(n log n) to O(n(log σ + log D + log log n)) bits, and to O(n(log σ + log D)) bits in the case of the popular term frequency measure of relevance, at the price of an additive term O(log ε σ n) in the query time, for any constant ε > 0. We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time O(p(log log n) 2 / log σ n+log n+k log log k), whereas insertions and deletions require O(log 1+ε n) time per symbol, for any constant ε > 0. Finally, we consider an extended static scenario where an extra parameter par(P, d) is defined, and the query must retrieve only documents d such that par(P, d) ∈ [τ 1 , τ 2 ], where this range is specified at query time. We solve these queries using linear space and O(p/ log σ n + log 1+ε n + k log ε n) time, for any constant ε > 0. Our technique is to translate these top-k problems into multidimensional geometric search problems. As a bonus, we describe some improvements to those problems.

show abstract

Section: Discussionmentioning

confidence: 90%

Time-Optimal Top-$k$ Document Retrieval

Navarro¹,

Nekrich²

2017

SIAM J. Comput.

View full text Add to dashboard Cite

show abstract

“…For example, categorical range counting queries (i.e., count the number of different values in a range) requires in general Ω(log n/ log log n) time if using O(n polylog n) space [11], where n is the array size, but if queries form a hierarchy it is easily solved in constant time and O(n) bits [13]. A second example is the range mode problem (i.e., find the most frequent value in a range), which is believed to require time Ω(n 1.188 ) if using O(n 1.188 ) space [4], but if queries form a hierarchy it is easily solved in constant time and linear space [8].…”

Section: Introductionmentioning

confidence: 99%

“…In this paper we aim at a compact data structure to represent data cubes where the domains in each dimension are hierarchical. Following the general idea of the tailored solutions to the problems we mentioned [13,8], our structure partitions the space according to the hierarchies, instead of performing a regular partition like generic multidimensional structures. Therefore, the queries of interest for OLAP applications, which combine nodes of the different hierarchies, will require aggregating the information of just a few nodes in our partitions, much fewer than if we used a generic space partitioning method.…”

Section: Introductionmentioning

confidence: 99%

Efficient Representation of Multidimensional Data over Hierarchical Domains

Brisaboa

Cerdeira-Pena

López-López

et al. 2016

String Processing and Information Retrieval

View full text Add to dashboard Cite

Abstract. We consider the problem of representing multidimensional data where the domain of each dimension is organized hierarchically, and the queries require summary information at a different node in the hierarchy of each dimension. This is the typical case of OLAP databases. A basic approach is to represent each hierarchy as a one-dimensional line and recast the queries as multidimensional range queries. This approach can be implemented compactly by generalizing to more dimensions the k 2 -treap, a compact representation of two-dimensional points that allows for efficient summarization queries along generic ranges. Instead, we propose a more flexible generalization, which instead of a generic quadtreelike partition of the space, follows the domain hierarchies across each dimension to organize the partitioning. The resulting structure is much more efficient than a generic multidimensional structure, since queries are resolved by aggregating much fewer nodes of the tree.

show abstract