A joint probabilistic classification model for resource selection

Hong, Dzung; Si, Luo; Bracke, Paul; Witt, Michael; Juchcinski, Tim

doi:10.1145/1835449.1835468

Cited by 25 publications

(22 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A resource is considered relevant if has more than a threshold (τ ) number of documents among the top T documents from the full result. Hong et al [10] extend this work for cases where a full dataset search is infeasible. Instead of the full dataset result, they build the 'full result' using just the top-T documents from each resource.…”

Section: Related Workmentioning

confidence: 87%

LTRo: Learning to Route Queries in Clustered P2P IR

Alkhawaldeh

Deepak²,

Jose

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Query Routing is a critical step in P2P Information Retrieval. In this paper, we consider learning to rank approaches for query routing in the clustered P2P IR architecture. Our formulation, LTRo, scores resources based on the number of relevant documents for each training query, and uses that information to build a model that would then rank promising peers for a new query. Our empirical analysis over a variety of P2P IR testbeds illustrate the superiority of our method against the state-of-the-art methods for query routing.

show abstract

Section: Related Workmentioning

confidence: 87%

LTRo: Learning to Route Queries in Clustered P2P IR

Alkhawaldeh

Deepak²,

Jose

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…While current shard-selection techniques do not combine multiple types of evidence to make predictions, prior work on text-based federated search used machine learning to combine a wide range of features for the task of resource selection [Arguello et al, 2009a;Hong et al, 2010]. In particular, because shards are topically focused, the query category features discussed later in Section 2.3 might contribute valuable evidence for shard selection.…”

Section: Selective Searchmentioning

confidence: 99%

Aggregated Search

Arguello

2017

FNT in Information Retrieval

View full text Add to dashboard Cite

“…In our IE scenario, to estimate the number of useful documents we should define f (d) = 1{d is useful}, that is, as the indicator function that returns 1 if d is useful and 0 otherwise. Various methods have been proposed to estimate properties of (queryable) document collections (e.g., collection size, number of documents relevant to a query, average document length) [4,19,34,35], and these methods can be classified in three broad classes: (i) surrogate-based methods, (ii) query pool-based methods, and (iii) query pool-free methods.…”

Section: Overview Of Estimation Approachesmentioning

confidence: 99%

“…19 Now, to normalize DCG@k-and obtain nDCG@k-, we need to calculate the DCG@k of an ideal ranking, namely, IDCG@k. Finally, nDCG@k = DCG@k IDCG@k .…”

Section: Experimental Settingsmentioning

confidence: 99%

See 1 more Smart Citation

Ranking Deep Web Text Collections for Scalable Information Extraction

Barrio

Gravano

Develder

2015

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Information extraction (IE) systems discover structured information from natural language text, to enable much richer querying and data mining than possible directly over the unstructured text. Unfortunately, IE is generally a computationally expensive process, and hence improving its efficiency, so that it scales over large volumes of text, is of critical importance. State-of-the-art approaches for scaling the IE process focus on one text collection at a time. These approaches prioritize the extraction effort by learning keyword queries to identify the "useful" documents for the IE task at hand, namely, those that lead to the extraction of structured "tuples." These approaches, however, do not attempt to predict which text collections are useful for the IE task-and hence merit further processing-and which ones will not contribute any useful output-and hence should be ignored altogether, for efficiency. In this paper, we focus on an especially valuable family of text sources, the so-called deep web collections, whose (remote) contents are only accessible via querying. Specifically, we introduce and study techniques for ranking deep web collections for an IE task, to prioritize the extraction effort by focusing on collections with substantial numbers of useful documents for the task. We study both (adaptations of) state-of-the-art resource selection strategies for distributed information retrieval, and IE-specific approaches. Our extensive experimental evaluation over realistic deep web collections, and for several different IE tasks, shows the merits and limitations of the alternative families of approaches, and provides a roadmap for addressing this critically important building block for efficient, scalable information extraction.

show abstract

A joint probabilistic classification model for resource selection

Cited by 25 publications

References 25 publications

LTRo: Learning to Route Queries in Clustered P2P IR

LTRo: Learning to Route Queries in Clustered P2P IR

Aggregated Search

Ranking Deep Web Text Collections for Scalable Information Extraction

Contact Info

Product

Resources

About