Multi-tier architecture for Web search engines

Risvik, Knut Magne; Aasheim, Y.; Lidal, M.

doi:10.1109/laweb.2003.1250291

Cited by 24 publications

(23 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is common for search engines, in both, research and commercial systems, to partition large document collection into multiple tiers and shards (disjoint indexes) [4,3,17].…”

Section: Introductionmentioning

confidence: 99%

Document allocation policies for selective searching of distributed indexes

Kulkarni

Callan

2010

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Indexes for large collections are often divided into shards that are distributed across multiple computers and searched in parallel to provide rapid interactive search. Typically, all index shards are searched for each query. For organizations with modest computational resources the high query processing cost incurred in this exhaustive search setup can be a deterrent to working with large collections. This paper investigates document allocation policies that permit searching only a few shards for each query (selective search) without sacrificing search accuracy. Random, source-based and topic-based document-to-shard allocation policies are studied in the context of selective search.A thorough study of the tradeoff between search cost and search accuracy in a sharded index environment is performed using three large TREC collections. The experimental results demonstrate that selective search using topic-based shards cuts the search cost to less than 1/5th of that of the exhaustive search without reducing search accuracy across all the three datasets. Stability analysis shows that 90% of the queries do as well or improve with selective search. An overlap-based evaluation with an additional 1000 queries for each dataset tests and confirms the conclusions drawn using the smaller TREC query sets.

show abstract

“…It is common for search engines, in both, research and commercial systems, to partition large document collection into multiple tiers and shards (disjoint indexes) [4,3,17].…”

Section: Introductionmentioning

confidence: 99%

Document allocation policies for selective searching of distributed indexes

Kulkarni

Callan

2010

Proceedings of the 19th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…Our work is different from the works in [4,17] as follows. The work in [5] reports only four sample query times, and the work in [4] does not include any performance evaluation, while we analyse the strategy for partitioning a collection by document, broken down overall performance in costs of critical phases of query execution and identified a set of design trade-offs over a distributed architecture.…”

Section: Related Workmentioning

confidence: 88%

“…This issue is outside the scope of our work, but it is one that deserves attention, particularly by the community of systems performance and operating systems. The Google and FAST search engine architectures are presented in [4,5,17]. In the first phase of query execution, index servers consult an inverted index and determine a set of relevant documents.…”

Section: Throughputmentioning

confidence: 99%

“…A critical component of all search engines is the cluster of servers where the indexes are stored [4,17]. Since user queries are treated as a conjunction of the query terms (to reduce the size of the answer set), the servers in the cluster need to read the answers from disk, execute a conjunctive operation on them, and rank the selected answers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Basic issues on the processing of web queries

Badué

Barbosa

Golgher

et al. 2005

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Search engines represent a key component of Web economy these days. Despite that, there is not much technical literature available on their design, fine tuning, and internal operation. In this work, we make a preliminary attempt to partially fulfill this gap. We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. The second phase has cost that is basically constant on the size of the collection, while the cost of the first phase is affected by the size of the collection. Thus, we concentrate here on studying the behavior of a search engine while executing the first phase of query processing. Using real data and a small cluster of index servers, we study four basic and key issues related to this first phase of query processing: load balance, broker behavior, performance by individual index servers, and overall throughput. Our study, while preliminary, does reveal interesting tradeoffs: (1) that load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) that the broker is not a bottleneck, (3) that disk and CPU utilization at individual servers depends on the relationship between memory size and the distribution of frequencies for the query terms, and (4) that load unbalance at high loads prevents higher throughput. Our results suggest that further studying and evaluating search engines is a promising research avenue.

show abstract

“…The first two factors can be easily included while computing scores as outlined above. The most commonly used way to integrate the other factors is to precompute a global importance score for each document, as done in PageRank [9], or a few importance scores for different topic groups [18], and to simply add these scores to the term-based scores during query execution [29,24,30]. Our approach does not depend on the ranking function as long as the total cost is dominated by the inverted list traversal.…”

Section: Term-based Rankingmentioning

confidence: 99%

Three-Level Caching for Efficient Query Processing in Large Web Search Engines

Long¹,

Suel²

2006

World Wide Web

View full text Add to dashboard Cite

Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.

show abstract

Multi-tier architecture for Web search engines

Cited by 24 publications

References 5 publications

Document allocation policies for selective searching of distributed indexes

Document allocation policies for selective searching of distributed indexes

Basic issues on the processing of web queries

Three-Level Caching for Efficient Query Processing in Large Web Search Engines

Contact Info

Product

Resources

About