Boosting static pruning of inverted files

Abstract. An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track.

Section: Performance Comparison Of Indexing Strategies: Focused Taskmentioning

confidence: 96%

Section: Pruning the Element-index For Xml Retrievalmentioning

confidence: 99%

See 1 more Smart Citation

XML Retrieval Using Pruned Element-Index Files

Altıngövde

Atilgan

Ulusoy

2010

“…In a nutshell, TCP scores (using the Smart's TFIDF function) and sorts the postings of each term in the collection and removes the tail of the list according to some decision criteria. In [1], instead of the TFIDF function, BM25 is employed during the pruning and retrieval stages. In that study, it's shown that by tuning the pruning algorithm according to the score function, it is possible to further boost the performance.…”

Section: Static Inverted Index Pruningmentioning

confidence: 99%

“…Thus, it is hard to infer how these two approaches, namely, TCP and DCP, compare to each other. Furthermore, given the evidence of recent work on how tuning the scoring function boosts the performance [1], it is important to investigate the robustness of these methods for different scoring functions that are employed during the pruning and retrieval, i.e., query execution.…”

Section: Static Inverted Index Pruningmentioning

confidence: 99%

A Practitioner’s Guide for Static Index Pruning

Altıngövde

Ozcan

Ulusoy

2009

Abstract.We compare the term-and document-centric static index pruning approaches as described in the literature and investigate their sensitivity to the scoring functions employed during the pruning and actual retrieval stages. Static Inverted Index PruningStatic index pruning permanently removes some information from the index, for the purposes of utilizing the disk space and improving query processing efficiency. In the literature, several approaches are investigated for the static index pruning techniques.Among those methods, the term-centric pruning (referred to as TCP hereafter) proposed in [3] is shown to be very successful at keeping the top-k (k≤30) answers almost unchanged for the queries while significantly reducing the index size. In a nutshell, TCP scores (using the Smart's TFIDF function) and sorts the postings of each term in the collection and removes the tail of the list according to some decision criteria. In [1], instead of the TFIDF function, BM25 is employed during the pruning and retrieval stages. In that study, it's shown that by tuning the pruning algorithm according to the score function, it is possible to further boost the performance.On the other hand, the document-centric pruning (referred to as DCP hereafter) introduced in [2] is also shown to give high performance gains. In DCP approach, only those terms that can most probably be queried are left in a document, and others are discarded. The importance of a term for a document is determined by its contribution to the document's Kullback-Leibler divergence (KLD) from the entire collection. However, the experimental setup in this latter work is significantly different than that of [3]. That is, only the most frequent terms of the collection are pruned and the resulting (relatively small) index is kept in the memory, whereas the remaining unpruned body of index resides on the disk. During retrieval, if the query term is not found in the pruned index in memory, the unpruned index is consulted. Thus, it is hard to infer how these two approaches, namely, TCP and DCP, compare to each other. Furthermore, given the evidence of recent work on how tuning the scoring function boosts the performance [1], it is important to investigate the robustness of these methods for different scoring functions that are employed during the pruning and retrieval, i.e., query execution.In this paper, we provide a performance comparison of TCP and DCP approaches in terms of the retrieval effectiveness for certain pruning levels. Furthermore, for TCP, we investigate how using the Kullback-Leibler divergence scores, instead of TFIDF or BM25, during the pruning affects the performance. This may allow applying the TCP method independent of the retrieval function and thus providing more flexibility for the

Focused Retrieval and Evaluation

2010

In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics.