Document allocation policies for selective searching of distributed indexes

Kulkarni, Anagha; Callan, Jamie

doi:10.1145/1871437.1871497

Cited by 46 publications

(73 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in this approach one step goes forward, using an ontology-based fuzzy similarity, based on both semantic and structural issues. The centralization used in these techniques [28][29][30][31] for dealing with a distributed framework results in a reduction of the search efficiency. Obviously, gathering information within the distributed system causes a computational overload.…”

Section: B Experimentsmentioning

confidence: 99%

A Distributed Framework for Content Search Using Small World Communities

Javadi-Moghaddam¹,

Kollias²

2016

ijacsa

View full text Add to dashboard Cite

Abstract-The continuous growth of multimedia content available all over the web is raising the importance of a distributed framework for searching it. One of the important parameters in a distributed environment is system response time. This parameter specially plays an important role in search and retrieval. A novel two-tier structure is introduced in this paper, which focuses on the community concept to facilitate creation of ontological small worlds that can effectively assist the search task. As a result, user queries are forwarded to nodes that are likely to contain the relevant resources. Evaluation of the framework proves that the small world character of the proposed structure provides queries with better route selection and searching efficiency.

show abstract

Section: B Experimentsmentioning

confidence: 99%

A Distributed Framework for Content Search Using Small World Communities

Javadi-Moghaddam¹,

Kollias²

2016

ijacsa

View full text Add to dashboard Cite

show abstract

“…Recent research has focused on reducing the search cost per query without hurting overall effectiveness by reordering the documents in each shard by topic or similarity [3]. These systems are able to achieve effectiveness close to a search over the entire collection (exhaustive search) while using only a few shards for each Table 1: The proportion of system instances that demonstrated a significant difference using a paired t-test, and the p values when comparing the sample-based IR algorithm proposed by Kulkarni and Callan [3] at varying CSI sample rates with a deterministic exhaustive search, and with itself (a nondeterministic algorithm) with a CSI sample rate of 4% using the TREC GOV2 dataset and TREC topics 701 -850.…”

Section: Case Studymentioning

confidence: 99%

“…Thus, we have 10 different instances of the sharded index. As with the original experiments [3], 50 shards were formed per instance, and the full dependency model (FDM) is used to rank the queries [4]. Selecting a subset of 5 shards produced equivalent retrieval results at depth 10 to exhaustive search [3].…”

Section: Experimental Testbedmentioning

confidence: 99%

“…with non-deterministic output introduces an additional dimension of variability into this scenario. In this paper, we contribute the following • We present a methodology to solve the two-dimensional significance testing problem (Section 3); • We explore the properties of our solution on a case study of common sampling-based algorithms -shard construction and centralized resource allocation in distributed IR [3,6]. We examine the variability that can occur in this environment, observing that an apparently significant result on one instance of a samplebased algorithm can be contradicted by another, and we demonstrate the use of our two-dimensional significance testing methods to handle the variability and provide sound statistical inferences (Section 4).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating non-deterministic retrieval systems

Jayasinghe

Webber²,

Sanderson

et al. 2014

Proceedings of the 37th International ACM SIGIR Conference on Research &Amp; Development in Information Retrieval

View full text Add to dashboard Cite

The use of sampling, randomized algorithms, or training based on the unpredictable inputs of users in Information Retrieval often leads to non-deterministic outputs. Evaluating the effectiveness of systems incorporating these methods can be challenging since each run may produce different effectiveness scores. Current IR evaluation techniques do not address this problem. Using the context of distributed information retrieval as a case study for our investigation, we propose a solution based on multivariate linear modeling. We show that the approach provides a consistent and reliable method to compare the effectiveness of non-deterministic IR algorithms, and explain how statistics can safely be used to show that two IR algorithms have equivalent effectiveness.

show abstract

“…e goals of index partitioning algorithms are to distribute documents across nodes based on document similarity, to facilitate the e cient selection of retrieval resources, such that documents relevant to a query are concentrated across a few shards [22]. ere are two main index partitioning strategies [9]:…”

Section: Introductionmentioning

confidence: 99%

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

Mourão

Magalhães

2017

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

is paper addresses the problem of balanced, redundant indexing of media information. Our goal is to partition and distribute the search index, taking advantage of the distributed systems properties: balanced load across nodes, redundancy on node down and e cient node usage under concurrent querying. We follow an information compression approach to solve this problem and propose to represent data with overcomplete codebooks, where each document is represented by only a few codewords and an indexing node is responsible for several codewords.antization algorithms are designed to t the original data as best as possible, leading to bias towards codewords that t the principal directions of data. In this paper, we propose the balanced KSVD (B-KSVD) algorithm, that distributes the allocation of data across a balanced number of codewords, according to the global distribution of data. Indexing experiments showed that B-KSVD can achieve 38% 1-recall by inspecting only 1% of the full index, distributed over 10 partitions. Traditional methods based on k-means need to either use larger codebooks or to inspect a larger portion of the index to achieve the same retrieval performance.

show abstract

Document allocation policies for selective searching of distributed indexes

Cited by 46 publications

References 18 publications

A Distributed Framework for Content Search Using Small World Communities

A Distributed Framework for Content Search Using Small World Communities

Evaluating non-deterministic retrieval systems

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

Contact Info

Product

Resources

About