Evaluation strategies for top-k queries over memory-resident inverted indexes

Fontoura, Marcus; Josifovski, Vanja; Liu, Jinhui; Venkatesan, S.; Zhu, Xiangfei; Zien, Jason Y.

doi:10.14778/3402755.3402756

Cited by 56 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These approaches tend to be more efficient than are DaaT MaxScore and term-at-a-time (TaaT) approaches [67,68], particularly for short queries, the most common scenario in web search. However, for long queries or large candidate sets, the case is less clear-cut [23,33,54]; moreover, fusion over query variations often leads to very long queries.…”

Section: Efficient Index Traversalmentioning

confidence: 99%

Boosting Search Performance Using Query Variations

Benham

Mackenzie

Moffat

et al. 2019

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Rank fusion is a powerful technique that allows multiple sources of information to be combined into a single result set. However, to date fusion has not been regarded as being cost-effective in cases where strict perquery efficiency guarantees are required, such as in web search. In this work we propose a novel solution to rank fusion by splitting the computation into two parts -one phase that is carried out offline to generate pre-computed centroid answers for queries with broadly similar information needs, and then a second online phase that uses the corresponding topic centroid to compute a result page for each query. We explore efficiency improvements to classic fusion algorithms whose costs can be amortized as a pre-processing step, and can then be combined with re-ranking approaches to dramatically improve effectiveness in multi-stage retrieval systems with little efficiency overhead at query time. Experimental results using the ClueWeb12B collection and the UQV100 query variations demonstrate that centroid-based approaches allow improved retrieval effectiveness at little or no loss in query throughput or latency, and with reasonable pre-processing requirements. We additionally show that queries that do not match any of the pre-computed clusters can be accurately identified and efficiently processed in our proposed ranking pipeline.This work is currently under review.

show abstract

Section: Efficient Index Traversalmentioning

confidence: 99%

Boosting Search Performance Using Query Variations

Benham

Mackenzie

Moffat

et al. 2019

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

show abstract

“…However, if a few n j corresponding to "head" labels are near N, which means almost all data points have the same label, the above cost reaches O(N 2 ). Efficient algorithms for top-k retrieval on an inverted index [11] or finding approximate k-nearest neighbors [12] can be applied to this situation. However, we focus on tail labels and ignore head labels in some cases.…”

Section: Learning To Partition Data Pointsmentioning

confidence: 99%

Speeding up Extreme Multi-Label Classifier by Approximate Nearest Neighbor Search

Tagami

2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Extreme multi-label classification methods have been widely used in Web-scale classification tasks such as Web page tagging and product recommendation. In this paper, we present a novel graph embedding method called "AnnexML". At the training step, AnnexML constructs a k-nearest neighbor graph of label vectors and attempts to reproduce the graph structure in the embedding space. The prediction is efficiently performed by using an approximate nearest neighbor search method that efficiently explores the learned k-nearest neighbor graph in the embedding space. We conducted evaluations on several large-scale real-world data sets and compared our method with recent state-of-the-art methods. Experimental results show that our AnnexML can significantly improve prediction accuracy, especially on data sets that have a larger label space. In addition, AnnexML improves the trade-off between prediction time and accuracy. At the same level of accuracy, the prediction time of AnnexML was up to 58 times faster than that of SLEEC, a state-of-the-art embeddingbased method.

show abstract

“…The interest of casting these updates as variants of MIPS problems is to exploit the ideas developed in the literature for solving these problems efficiently. Teflioudi and Gemulla (2016) and Fontoura et al (2011) give good overviews of MIPS solvers developed for recommender systems and information retrieval applications respectively. In both cases, the proposed methods rely on two main ideas: (i) adequate indexing techniques or data structures and (ii) pruning criteria which allow to not compute all inner products entirely.…”

Section: Updating the Working Setmentioning

confidence: 99%

WHInter: A Working set algorithm for High-dimensional sparse second order Interaction models

Morvan,

Vert

2018

Preprint

View full text Add to dashboard Cite

Learning sparse linear models with two-way interactions is desirable in many application domains such as genomics. 1 -regularised linear models are popular to estimate sparse models, yet standard implementations fail to address specifically the quadratic explosion of candidate two-way interactions in high dimensions, and typically do not scale to genetic data with hundreds of thousands of features. Here we present WHInter, a working set algorithm to solve large 1regularised problems with two-way interactions for binary design matrices. The novelty of WHInter stems from a new bound to efficiently identify working sets while avoiding to scan all features, and on fast computations inspired from solutions to the maximum inner product search problem. We apply WHInter to simulated and real genetic data and show that it is more scalable and two orders of magnitude faster than the state of the art.

show abstract

Evaluation strategies for top-k queries over memory-resident inverted indexes

Cited by 56 publications

References 21 publications

Boosting Search Performance Using Query Variations

Boosting Search Performance Using Query Variations

Speeding up Extreme Multi-Label Classifier by Approximate Nearest Neighbor Search

WHInter: A Working set algorithm for High-dimensional sparse second order Interaction models

Contact Info

Product

Resources

About