2004
DOI: 10.1007/978-3-540-24752-4_29
|View full text |Cite
|
Sign up to set email alerts
|

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Abstract: Abstract.We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
22
1

Year Published

2004
2004
2012
2012

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 12 publications
1
22
1
Order By: Relevance
“…This study is a continuation of our previous work, introduced in [1] and extended in [2], on the choice of optimal architectures for building a distributed large-scale IR system. The SPIRIT collection (94,552,870 documents and 1 terabyte (TB) of text) [3] was used in these previous studies to simulate a distributed IR system using a local inverted file strategy, with the aim of measuring the performance for different configurations (distributed, replicated and clustered systems).…”
Section: Introductionmentioning
confidence: 82%
See 2 more Smart Citations
“…This study is a continuation of our previous work, introduced in [1] and extended in [2], on the choice of optimal architectures for building a distributed large-scale IR system. The SPIRIT collection (94,552,870 documents and 1 terabyte (TB) of text) [3] was used in these previous studies to simulate a distributed IR system using a local inverted file strategy, with the aim of measuring the performance for different configurations (distributed, replicated and clustered systems).…”
Section: Introductionmentioning
confidence: 82%
“…The simulated distributed IR system is an extension of the Terrier IR system described in [5]. Moreover, we use the analytical model described in [1] and [2] for the simulation of the querying process in the distributed IR system. The SPIRIT collection [3] is simulated (94,552,870 documents and on average 456 words per document).…”
Section: Simulation Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, in our work on text-based information retrieval we use the SPIRIT collection of 94,552,870 web pages [8] crawled directly from the internet in 2001, as described in section 3. As noted in [3] the size of the vocabulary for a collection of text documents follows Heaps law [6] and with an average document length of 456 terms, the number of index terms should be approximately 73,600,000. Although this is a huge number of terms, most of them correspond to numeric and mis-spellings and have very low frequencies of occurrence, and the actual number of content-bearing terms, or dimensions in the feature space, is much less.…”
Section: Document Retrievalmentioning
confidence: 99%
“…There are a number of papers evaluating DP parallel IR systems; see for instance [1], [4], [5], [32], [34] All of the above mentioned studies adopt a common architecture for parallel IRSs. It follows the master/worker model where workers are the actual search modules which receive queries from and return results to the master that is also known as the query broker (QB).…”
Section: Introductionmentioning
confidence: 99%