Top-k Query Evaluation with Probabilistic Guarantees

Theobald, Martin; Weikum, Gerhard; Schenkel, Ralf; Nascimento, Mário A.; Özsu, M. Tamer; Kossmann, Donald; Miller, Renée J.; Blakeley, José A.; Schiefer, K. Bernhard

doi:10.1016/b978-012088469-8/50058-9

Cited by 80 publications

(50 citation statements)

References 13 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ayanso et al [3] analyze the common histogram construction techniques and their impact on top-k retrieval. Theobald et al [21] propose a method for probabilistic topk queries by predicting the total score of a candidate item. In some cases, random access is limited or unavailable, NRA [12] is proposed with sequential access only.…”

Section: Related Workmentioning

confidence: 99%

Supporting early pruning in top-k query processing on massive data

Han

Yang

2011

Information Processing Letters

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Supporting early pruning in top-k query processing on massive data

Han

Yang

2011

Information Processing Letters

View full text Add to dashboard Cite

“…The algorithm terminates when the candidate queue is empty (and a virtual document that has not yet been seen in any index list and has a bestscore ≤ i=1...m high(i) can not qualify for the topk either). For approximating a top-k result with low error probability [52], the conservative bestscores, with high(i) values assumed for unknown scores, can be substituted by quantiles of the score distribution in the unvisited tails of the index lists. Technically, this amounts to estimating the convolution of the unknown scores of a candidate.…”

Section: Related Workmentioning

confidence: 99%

“…Top-k query processing has received much attention in a variety of settings such as similarity search on multimedia data [7,24,29,30,45,46], ranked retrieval on text and semistructured documents in digital libraries and on the Web [3,6,36,40,48,52,55], network and stream monitoring [4,14] collaborative recommendation and preference queries on ecommerce product catalogs [17,31,42,56], and ranking of SQL-style query results on structured data sources in general [1,11,18]. Among the ample work on top-k query processing, the TA family of algorithms for monotonic score aggregation [25,30,46] stands out as an extremely efficient and highly versatile method.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore standard cardinality estimators are not enough, the query optimizer needs to estimate score distributions. Prior work on this issue either used fairly crude models like assuming Normal distributions [35], which is not a good fit for real-data scores, or required extensive computations like sampling of histogram maintenance [52] that may incur high costs in a distributed setting with a high-latency network.…”

Section: Distributed Statistics and Cost Modelmentioning

confidence: 99%

See 1 more Smart Citation

Algebraic query optimization for distributed top-k queries

Neumann

Michel

2007

Informatik Forsch. Entw.

View full text Add to dashboard Cite

Distributed top-k query processing is increasingly becoming an essential functionality in a large number of emerging application classes. This paper addresses the efficient algebraic optimization of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We use a dynamic programming approach to find the optimal execution plan using compact data synopses for selectivity estimation that is the basis for our cost model. The optimized query is executed in a hierarchical way involving a small and fixed number of communication phases. We have performed experiments on real web data that show the benefits of distributed top-k query optimization both in network resource consumption and query response time

show abstract

“…However, our preliminary work is limited to rank formulation for numerical data, while this paper reports our extension, which enables both processing and formulation of the combination of numerical and categorical data. As supporting structures for ranked retrieval, one-dimensional (e.g., sorted access [8,7,3,4,11] or inverted index [1,18,19,5]) or multi-dimensional numerical indices (e.g., R-tree [17]) have been considered. In particular, our work is closely related to [5] indexing the relevance score of each possible value by applying Bayes' Rule on prior query workload and [17] indexing multi-dimensional objects by the similarity score using an R-tree index.…”

Section: Related Workmentioning

confidence: 99%