Using interdocument similarity information in document retrieval systems

Griffiths, Arlo; Luckhurst, H. Claire; Willett, Peter

doi:10.1002/asi.4630370102

Cited by 84 publications

(25 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The minimum value of EX and the highest performance level for a random structure is EX = .79; a performance level that compares favorably to current cluster-based retrieval results from a variety of test collections using a variety of clustering criteria (Griffiths et al, 1986).…”

Section: Effectiveness Of Random Structures: Ex(dwts)mentioning

confidence: 87%

“…When E = 0, P = R = 1.0, and when E = 1, P = R = 0.0. The E measure has been selected because it provides the opportunity to compare results of the current investigation with existing clusterbased retrieval results (Croft, 1980;Van Rijsbergen, 1974a;Van Rijsbergen & Croft, 197.5;Willett, 1984), especially the comprehensive results provided by Griffiths, Luckhurst, & Willett (1986).…”

Section: Effectiveness Measuresmentioning

confidence: 99%

See 1 more Smart Citation

Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations

Shaw

1991

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

Measures of cluster-based retrieval effectiveness are computed for five composite representations in the cystic fibrosis (CF) Document Collection. The composite representations are constructed from combinations of two subject representations, based on Medical Subject Headings and subheadings, and two citation representations, consisting of the complete list of cited references and a comprehensive list of citations for each document. Experimental retrieval results are presented as a function of the exhaustivity and similarity of the composite representations and reveal consistent patterns from which optimal performance levels can be identified. The optimal performance values provide an assessment of the absolute capacity of each composite representation to associate documents relevant to the same query and discriminate between documents relevant to different queries in single-link hierarchies. The optimal performance values for all composite representations are completely comparable and are superior to the optimal performance of constituent representations. Optimal performance consistently occurs at low levels of exhaustivity. Exhaustive composite representations that include subject descriptions produce the lowest levels of performance; retrieval results derived from random structures are comparable to the observed results. The effectiveness of the exhaustive representation composed of references and citations is materially superior to the effectiveness of exhaustive composite representations that include subject descriptions.

show abstract

Section: Effectiveness Of Random Structures: Ex(dwts)mentioning

confidence: 87%

Section: Effectiveness Measuresmentioning

confidence: 99%

Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations

Shaw

1991

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

show abstract

“…Indeed, there is a history of successful applications of the general nearest-neighbor approach (e.g., [9]). Within each iteration, Cluster-Audition scoring consists of two phases.…”

Section: Basic Methods For Scoring Renderersmentioning

confidence: 99%

Better than the real thing?

Kurland

Lee

Domshlak

2005

Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

We present a novel approach to pseudo-feedback-based ad hoc retrieval that uses language models induced from both documents and clusters. First, we treat the pseudo-feedback documents produced in response to the original query as a set of pseudo-queries that themselves can serve as input to the retrieval process. Observing that the documents returned in response to the pseudo-queries can then act as pseudo-queries for subsequent rounds, we arrive at a formulation of pseudo-query-based retrieval as an iterative process. Experiments show that several concrete instantiations of this idea, when applied in conjunction with techniques designed to heighten precision, yield performance results rivaling those of a number of previously-proposed algorithms, including the standard language-modeling approach. The use of cluster-based language models is a key contributing factor to our algorithms' success.

show abstract

“…Since this set is query-dependent, at least some of the clustering process must occur at retrieval time, mandating the use of extremely efficient algorithms [6,37]. The approach we adopt is to use overlapping nearest-neighbor clusters, which have formed the basis of effective retrieval algorithms in other work [12,17,19,33]: for each document d ∈ Dinit, we have the cluster {d} ∪ N bhd(d | k − 1, Dinit − {d}), where k is the cluster-size parameter.…”

Section: Graph Constructionmentioning

confidence: 99%

Respect my authority!

Kurland

Lee

2006

Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

We present an approach to improving the precision of an initial document ranking wherein we utilize cluster information within a graph-based framework. The main idea is to perform re-ranking based on centrality within bipartite graphs of documents (on one side) and clusters (on the other side), on the premise that these are mutually reinforcing entities. Links between entities are created via consideration of language models induced from them.We find that our cluster-document graphs give rise to much better retrieval performance than previously proposed document-only graphs do. For example, authority-based re-ranking of documents via a HITS-style cluster-based approach outperforms a previously-proposed PageRank-inspired algorithm applied to solely-document graphs. Moreover, we also show that computing authority scores for clusters constitutes an effective method for identifying clusters containing a large percentage of relevant documents.

show abstract

Using interdocument similarity information in document retrieval systems

Cited by 84 publications

References 21 publications

Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations

Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations

Better than the real thing?

Respect my authority!

Contact Info

Product

Resources

About