An experimental investigation of set intersection algorithms for text searching

Barbay, Jérémy; López-Ortíz, Alejandro; Lu, Tyler; Salinger, Alejandro

doi:10.1145/1498698.1564507

Cited by 50 publications

(61 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experiments in [22], [23] compare several intersection algorithms and show that the complexity of intersections relies heavily on the distributions of the elements in the sets.…”

Section: Related Workmentioning

confidence: 99%

A Set Intersection Algorithm Via x-Fast Trie

Ye¹

2016

JCP

View full text Add to dashboard Cite

This paper proposes a simple intersection algorithm for two sorted integer sequences . Our algorithm is designed based on x-fast trie since it provides efficient find and successor operators. We present that our algorithm outperforms skip list based algorithm when one of the sets to be intersected is relatively 'dense' while the other one is (relatively) 'sparse'. Finally, we propose some possible approaches which may optimize our algorithm further.

show abstract

“…Experiments in [22], [23] compare several intersection algorithms and show that the complexity of intersections relies heavily on the distributions of the elements in the sets.…”

Section: Related Workmentioning

confidence: 99%

A Set Intersection Algorithm Via x-Fast Trie

Ye¹

2016

JCP

View full text Add to dashboard Cite

show abstract

“…A typical way to solve a ranked intersection is to first compute a Boolean intersection, then compute the scores of all the resulting documents, and finally keep the documents with the k highest scores. This approach has triggered much research on the Boolean intersection problem [21,6,34,8,26]. This approach is, of course, suboptimal, since in principle one could use weight information to filter out documents that belong to the intersection but one can ensure will not make it to the top-k list.…”

Section: Basic Conceptsmentioning

confidence: 99%

“…Traditionally, the posting lists were stored on disk. With the availability of large amounts of main memory, this trend has changed to use the main memory of a cluster of machines, and many intersection algorithms have been designed for random access [21,6,34,20,35,37,8,26]. In distributed main-memory systems, usually documents are distributed across independent inverted indexes, and each index contributes with a few results to the final top-k list.…”

Section: Basic Conceptsmentioning

confidence: 99%

Faster and smaller inverted indices with treaps

Konow

Navarro

Clarke

et al. 2013

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that the space consumption is below 10% of the size of the corpus and the index performs queries up to twice as fast than previous compact representations, which in addition require more space. Modern two-stage (massive filtering / detailed ranking) information retrieval systems would benefit from this boosting of the filtration stage of the query resolution process, which would free more resources for the ranking stage, thus enabling more precise results within a given time budget.

show abstract

“…Thus, in order to retrieve the query result we usually have to scan the entire lists. There has been extensive work on list intersection algorithms, that is applicable to inverted files [6,39,43]. The focus in these works lies in reducing the CPU cost, since they are mostly aimed at specialized systems, which answer few types of queries and can afford to have all lists in main memory.…”

Section: Query Evaluationmentioning

confidence: 99%

“…The former exploits parallelism between different queries, while the latter parallelizes the processing within a single query. Finally, [6] offers an experimental comparison of several popular methods of list intersection with respect to their CPU cost.…”

Section: Related Workmentioning

confidence: 99%

Efficient answering of set containment queries for skewed item distributions

Terrovitis

Bouros²,

Vassiliadis

et al. 2011

Proceedings of the 14th International Conference on Extending Database Technology

View full text Add to dashboard Cite

In this paper we address the problem of efficiently evaluating containment (i.e., subset, equality, and superset) queries over set-valued data. We propose a novel indexing scheme, the Ordered Inverted File (OIF) which, differently from the state-of-the-art, indexes setvalued attributes in an ordered fashion. We introduce query processing algorithms that practically treat containment queries as range queries over the ordered postings lists of OIF and exploit this ordering to quickly prune unnecessary page accesses. OIF is simple to implement and our experiments on both real and synthetic data show that it greatly outperforms the current state-of-the-art methods for all three classes of containment queries.

show abstract

An experimental investigation of set intersection algorithms for text searching

Cited by 50 publications

References 11 publications

A Set Intersection Algorithm Via x-Fast Trie

A Set Intersection Algorithm Via x-Fast Trie

Faster and smaller inverted indices with treaps

Efficient answering of set containment queries for skewed item distributions

Contact Info

Product

Resources

About