Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

Cahoon, Brendon; McKinley, Kathryn S.; Lu, Zhixing

doi:10.1145/333135.333136

Cited by 49 publications

(37 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We demonstrate that partial replicas can significantly outperform caches using a validated simulator [7,18] which closely matches our working prototype system with replica selection. The prototype uses InQuery for the basic IR functionality [8].…”

Section: Introductionmentioning

confidence: 69%

“…Most of the previous work experiments with a text database less than 1 GB and focuses on speedup when a text database is distributed over more servers [5,12,17,20]. Only Couvreur et al [9], and Cahoon et al [6,7] use simulation to experiment with more than 100 GB of data. None of these previous studies include partial replication or caching.…”

Section: Scalable Ir Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

Partial collection replication versus caching for information retrieval systems

Lu¹,

McKinley

2000

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have locality, both mechanisms return results more quickly than sending queries to the original collection(s). Caches return results when queries exactly match a previous one. Partial replicas are a form of caching that return results when the IR technology determines the query is a good match. Caches are simpler and faster, but replicas can increase locality by detecting similarity between queries that are not exactly the same. We use real traces from THOMAS and Excite to measure query locality and similarity. With a very restrictive definition of query similarity, similarity improves query locality up to 15% over exact match. We use a validated simulator to compare their performance, and find that even if the partial replica hit rate increases only 3 to 6%, it will outperform simple caching under a variety of configurations. A combined approach will probably yield the best performance.

show abstract

Section: Introductionmentioning

confidence: 69%

Section: Scalable Ir Architecturesmentioning

confidence: 99%

Partial collection replication versus caching for information retrieval systems

Lu¹,

McKinley

2000

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…Parallel generation of a global index has been studied in [18], while a system which crawls the Web and builds a distributed local index was presented in [16]. Cahoon et al [4] evaluated the computational performance of local indices under a variety of workloads, and Hawking [6] examined scalability issues of local index organizations. The prototype of Google was reported as using global index partitioning [3].…”

Section: Index Structure and Query Processing Modelsmentioning

confidence: 99%

“…The prototype of Google was reported as using global index partitioning [3]. Many of the above mentioned works [4,18,6,17,20] describe essentially the same model for processing queries in systems with segmented indices:…”

Section: Index Structure and Query Processing Modelsmentioning

confidence: 99%

Optimizing result prefetching in web search engines with segmented indices

Lempel

Moran

2004

ACM Trans. Internet Technol.

View full text Add to dashboard Cite

We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages which search engines should prepare during the query processing phase.Search engine users have been observed to browse through very few pages of results for queries which they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since most of the prefetched results will not be requested by the user who initiated the search. However, a policy which abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources as well.We argue that for a certain behavior of users, engines should prefetch a constant number of result pages per query. We define a concrete query processing model for search engines with segmented indices, and analyze the cost of such prefetching policies. Based on these costs, we show how to determine the constant which optimizes the prefetching policy. Our results are mostly applicable to local index partitions of the inverted files, but are also applicable to processing of short queries in global index architectures.

show abstract

“…We downloaded between 1 and 2,303 pages per site by crawling the first two levels of the site. Then, we generated a set of keyword queries from the downloaded pages using term frequencies commonly observed in information retrieval systems (from [4]). We used standard information retrieval techniques to determine which queries matched which documents; namely, TF/IDF weights and the cosine distance [2].…”

Section: Content Sets and Simulation Setupmentioning

confidence: 99%

A Content Model for Evaluating Peer-to-Peer Searching Techniques

Cooper

2004

Middleware 2004

View full text Add to dashboard Cite

Abstract. Simulation studies are frequently used to evaluate new peer-to-peer searching techniques as well as existing techniques on new applications. Unless these studies are accurate in their modeling of queries and documents, they may not reflect how search techniques will perform in real networks, leading to incorrect conclusions about which techniques are best. We describe how to model content so that simulations produce accurate results. We present a content model for peer-to-peer networks, which consists of a tripartite graph with edges connecting queries to the documents they match, and documents to the peers they are stored at. Our model also includes a set of statistics describing how often queries match the same documents, and how often similar documents are stored at the same peer. We can construct our tripartite content model by running queries over live data stored at real Internet nodes, and simulation results show that searching techniques do indeed perform differently in simulations using this "real" content model versus a randomly generated model. We then present an algorithm for using real content gathered from a small set of peers (say, 1,000) to generate a synthetic content model for large simulated networks (say, 10,000 nodes or more). Finally, we use a synthetic model generated from World Wide Web documents and queries to compare the performance of several search algorithms that have been reported in the literature.

show abstract

Evaluating the performance of distributed architectures for information retrieval using a variety of workloads

Cited by 49 publications

References 30 publications

Partial collection replication versus caching for information retrieval systems

Partial collection replication versus caching for information retrieval systems

Optimizing result prefetching in web search engines with segmented indices

A Content Model for Evaluating Peer-to-Peer Searching Techniques

Contact Info

Product

Resources

About