Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda, Fidel; Plachouras, Vassilis; Ounis, Iadh

doi:10.1007/978-3-540-24752-4_29

Cited by 16 publications

(24 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This study is a continuation of our previous work, introduced in [1] and extended in [2], on the choice of optimal architectures for building a distributed large-scale IR system. The SPIRIT collection (94,552,870 documents and 1 terabyte (TB) of text) [3] was used in these previous studies to simulate a distributed IR system using a local inverted file strategy, with the aim of measuring the performance for different configurations (distributed, replicated and clustered systems).…”

Section: Introductionmentioning

confidence: 82%

“…The simulated distributed IR system is an extension of the Terrier IR system described in [5]. Moreover, we use the analytical model described in [1] and [2] for the simulation of the querying process in the distributed IR system. The SPIRIT collection [3] is simulated (94,552,870 documents and on average 456 words per document).…”

Section: Simulation Modelmentioning

confidence: 99%

“…The SPIRIT collection [3] is simulated (94,552,870 documents and on average 456 words per document). In order to test the performance, we generate 50 queries, following the skewed query model [1] [2]. The performance is measured using 5 different simulations and calculating the corresponding average throughput.…”

Section: Simulation Modelmentioning

confidence: 99%

See 2 more Smart Citations

Performance analysis of distributed information retrieval architectures using an improved network simulation model

Cacheda

Carneiro

Plachouras

et al. 2007

Information Processing & Management

Self Cite

View full text Add to dashboard Cite

Abstract. In this study, we present the analysis of the interconnection network of a distributed Information Retrieval (IR) system, by simulating a switched network versus a shared access network. The results show that the use of a switched network improves the performance, especially in a replicated system because the switched network prevents the saturation of the network, particularly when using a large number of query servers.

show abstract

Section: Introductionmentioning

confidence: 82%

Section: Simulation Modelmentioning

confidence: 99%

See 1 more Smart Citation

Performance analysis of distributed information retrieval architectures using an improved network simulation model

Cacheda

Carneiro

Plachouras

et al. 2007

Information Processing & Management

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, in our work on text-based information retrieval we use the SPIRIT collection of 94,552,870 web pages [8] crawled directly from the internet in 2001, as described in section 3. As noted in [3] the size of the vocabulary for a collection of text documents follows Heaps law [6] and with an average document length of 456 terms, the number of index terms should be approximately 73,600,000. Although this is a huge number of terms, most of them correspond to numeric and mis-spellings and have very low frequencies of occurrence, and the actual number of content-bearing terms, or dimensions in the feature space, is much less.…”

Section: Document Retrievalmentioning

confidence: 99%

Text based approaches for content-based image retrieval on large image collections

Wilkins¹,

Ferguson²,

Smeaton³

et al. 2005

2nd European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology (EWIMT 2005)

View full text Add to dashboard Cite

As the growth of digital image collections continues so does the need for efficient content based searching of images capable of providing quality results within a search time that is acceptable to users who have grown used text search engine performance. Some existing techniques, whilst being capable of providing relevant results to a user's query will not scale up to very large image collections, the order of which will be in the millions. In this paper we propose a technique that uses text based IR methods for indexing MPEG-7 visual features (from the MPEG-7 XM) to perform rapid subset selection within large image collections. Our test collection consists of 750,000 images crawled from the SPIRIT collection (discussed in section 3) and a separate set of 1000 query images also from the SPIRIT collection. An initial experiment is presented to measure the accuracy of the subset generated for each query image by taking the top 100 results of the subset, and comparing those to the top 100 results derived from a complete ranking of the collection for that query image. Ranking is performed via L2 Minkowsky distance measures for both sets.

show abstract

“…There are a number of papers evaluating DP parallel IR systems; see for instance [1], [4], [5], [32], [34] All of the above mentioned studies adopt a common architecture for parallel IRSs. It follows the master/worker model where workers are the actual search modules which receive queries from and return results to the master that is also known as the query broker (QB).…”

Section: Introductionmentioning

confidence: 99%

Query-driven document partitioning and collection selection

Puppin

Silvestri

Laforenza

2006

Proceedings of the 1st International Conference on Scalable Information Systems - InfoScale '06

View full text Add to dashboard Cite

Abstract-We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11% and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52% of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query.

show abstract

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cited by 16 publications

References 12 publications

Performance analysis of distributed information retrieval architectures using an improved network simulation model

Performance analysis of distributed information retrieval architectures using an improved network simulation model

Text based approaches for content-based image retrieval on large image collections

Query-driven document partitioning and collection selection

Contact Info

Product

Resources

About