Indexing Shared Content in Information Retrieval Systems

Bröder, Arndt; Eiron, Nadav; Fontoura, Marcus; Herscovici, Michael; Lempel, Ronny; McPherson, John R.; Qi, Runping; Shekita, Eugene J.

doi:10.1007/11687238_21

Cited by 33 publications

(47 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Indexing versioned document collections has been studied in [7,25,14,13]. Broder et al [7] propose a technique that exploits large content overlaps between documents to achieve a reduction in index size.…”

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

See 1 more Smart Citation

Durable top-k search in document archives

Mamoulis

Berberich

et al. 2010

Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

We propose and study a new ranking problem in versioned databases. Consider a database of versioned objects which have different valid instances along a history (e.g., documents in a web archive). Durable top-k search finds the set of objects that are consistently in the top-k results of a query (e.g., a keyword query) throughout a given time interval (e.g., from June 2008 to May 2009). Existing work on temporal top-k queries mainly focuses on finding the most representative top-k elements within a time interval. Such methods are not readily applicable to durable top-k queries. To address this need, we propose two techniques that compute the durable top-k result. The first is adapted from the classic top-k rank aggregation algorithm NRA. The second technique is based on a shared execution paradigm and is more efficient than the first approach. In addition, we propose a special indexing technique for archived data. The index, coupled with a space partitioning technique, improves performance even further. We use data from Wikipedia and the Internet Archive to demonstrate the efficiency and effectiveness of our solutions.

show abstract

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

“…Broder et al [7] propose a technique that exploits large content overlaps between documents to achieve a reduction in index size. Each version is partitioned into a set of fragments, e.g., an email is partitioned into two fragments, subject and body.…”

Section: Indexing Versioned Document Collectionsmentioning

confidence: 99%

Durable top-k search in document archives

Mamoulis

Berberich

et al. 2010

Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…The so-called dictionary of the inverted index can be organized in different ways in order to meet the required types of queries and specification of data. In our research we will be focused on search trees [2,4,5,7].…”

Section: Introductionmentioning

confidence: 99%

Multiset Representation of Objects in Information Retrieval Systems

Akulich¹,

Krnc²,

Savnik³

et al. 2017

StuCoSReC. Proceedings of the 2017 4th Student Computer Science Research Conference.

View full text Add to dashboard Cite

In this paper we present multiset-trie -a novel data structure which operates on objects represented as multisets. The multiset-trie is a search-tree-based data structure with properties similar to those of a trie. In particular, we efficiently implement the standard search tree operations together with the special set containment operations, i.e. subset and superset queries in the context of multisets. These are called submultiset and supermultiset, respectively, and are used for implementation of various queries that can be performed on multisets in a multiset-trie. The corresponding running times of the developed functions are mathematically and experimentally analyzed. One of the most important queries is the search of the nearest neighbor given an input object. The nearest neighbor search of a multiset-trie makes it a good alternative for the index data structures that are used in information retrieval systems. In particular, our research is focused on the application of the multiset-trie to full-text search systems.

show abstract

“…Both inverted indexes for word and phrase queries over natural language texts [2,5,11,12], and other indexes for general string collections [16,6,14,7], have been pursued.…”

Section: Introductionmentioning

confidence: 99%

Indexes for highly repetitive document collections

Claude

Fariña

Martínez‐Prieto

et al. 2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on runlength, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.

show abstract

Indexing Shared Content in Information Retrieval Systems

Cited by 33 publications

References 20 publications

Durable top-k search in document archives

Durable top-k search in document archives

Multiset Representation of Objects in Information Retrieval Systems

Indexes for highly repetitive document collections

Contact Info

Product

Resources

About