Controlled and uncontrolled subject descriptions in the CF database: A comparison of optimal cluster-based retrieval results

Shaw, Jr. W. M.

doi:10.1016/0306-4573(93)90104-l

Cited by 14 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These were then used to produce single-term, uncontrolled representations for the collections. Following Shaw (1993) weights for the document vector collections were based on the customary inverse document frequency formula, log,(d/d,), where d is the total number of documents in the collection, and dk is the number of documents in which term k appears. Term weights were then normalized from zero to 999 so that M?k = 0 if term k is assigned to all documents and lvk = 999 if term k is assigned to one document.…”

Section: Test Collectionsmentioning

confidence: 99%

“…In a series of articles, Shaw has explored the effectiveness of cluster-based retrieval as a function of indexing exhaustivity. These investigations have included examinations of four subject representations based on MeSH subject headings and subheadings employed in the Medline database (Shaw, 1990), four composite representations that include subject and citation representations (Shaw, 1991), and both controlled and uncontrolled sub-ject representations (Shaw, 1993). The results suggest that the performance of a retrieval system based on single-link clustering varies as a function of indexing exhaustivity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity

Burgin

1995

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

The retrieval effectiveness of five hierarchical clustering methods (single link, complete link, group average, Ward's method, and weighted average) is examined as a function of indexing exhaustivity with four test collections (CF, Cranfield, Medlars, and Time). Evaluations of retrieval effectiveness, based on three measures of optimal retrieval performance, confirm earlier findings that the performance of a retrieval system based on single‐link clustering varies as a function of indexing exhaustively but fail to find similar patterns for other clustering methods. The data also confirm earlier findings regarding the poor performance of single‐link clustering in a retrieval environment. The poor performance of single‐link clustering appears to derive from that method's tendency to produce a small number of large, ill‐defined document clusters. By contrast, the data examined here found the retrieval performance of the other clustering methods to be generally comparable. The data presented here also provide an opportunity to examine the theoretical limits of cluster‐based retrieval and to compare these theoretical limits to the effectiveness of operational implementations. Performance standards for the four document collections examined here were found to vary widely, and the effectiveness of operational implementations were found to be in the range defined as “unacceptable.” Further improvements in search strategies and document representations warrant investigation. © 1995 John Wiley & Sons, Inc.

show abstract

Section: Test Collectionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity

Burgin

1995

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

show abstract

“…The retrieval process itself , i.e., the process of ranking and retrieving documents in response to a set of queries, may also be simulated, as in several articles by Shaw (1990aShaw ( , 1990bShaw ( , 1991Shaw ( , 1993. As with other simulation studies, these simulations of an optimal cluster-based retrieval system allow the variability of the retrieval mechanism to be controlled and thereby allow differences in other aspects of the retrieval process to be more carefully examined.…”

Section: Literature Reviewmentioning

confidence: 99%

The Monte Carlo method and the evaluation of retrieval system performance

Burgin

1999

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

The ability to distinguish between acceptable and unacceptable levels of retrieval performance and the ability to distinguish between significant and non‐significant differences between retrieval results are important to traditional information retrieval experiments. The Monte Carlo method is shown to represent an attractive alternative to the hypergeometric model for identifying the levels at which random retrieval performance is exceeded in retrieval test collections and for overcoming some of the limitations of the hypergeometric model. The Monte Carlo method produces low performance thresholds for the individual test collections that are very similar to the thresholds derived by the hypergeometric model, both at the test collection level and at the individual query level. In addition, the Monte Carlo method is much less computer‐intensive than the hypergeometric model, can be used with measures of retrieval effectiveness that take the rank order of the retrieved documents into consideration, can be used to derive the probability of obtained results, and can be used to determine the statistical significance of difference between two or more retrieval results. The ability to use the Monte Carlo method to derive the probability of obtained results and to compare two or more retrieval results makes it possible to determine more accurately how well retrieval systems operate under specific conditions and, in conjunction with the presentation of individual query results, makes it possible to determine whether relationships between query characteristics and retrieval system performance exist. Understanding these relationships should lead to improvements in the effectiveness of retrieval systems.

show abstract

“…Retrieval by titles, abstracts, and subject headings, in Compendex, was investigated by Byrne (1975). Controlled (subject headings) and uncontrolled subject descriptions (word-stems from titles and abstracts) produce similar levels of performance in retrieval and are thus complementary (Shaw Jr, 1993). Jenuwine and Floyd (2004) also concluded that MeSH descriptors and text-words should be used together for maximal retrieval.…”

Section: Introductionmentioning

confidence: 99%

Non‐agricultural databases and thesauri

Bartol

2012

Program

View full text Add to dashboard Cite

Purpose -The paper aims to assess the utility of non-agriculture-specific information systems, databases, and respective controlled vocabularies (thesauri) in organising and retrieving agricultural information. The purpose is to identify thesaurus-linked tree structures, controlled subject headings/terms (heading words, descriptors), and principal database-dependent characteristics and assess how controlled terms improve retrieval results (recall) in relation to free-text/uncontrolled terms in abstracts and document titles. Design/methodology/approach -Several different hosts (interfaces, platforms, portals) and databases were used: CSA Illumina (ERIC, LISA), Ebscohost (Academic Search Complete, Medline, Political Science Complete), Ei-Engineering Village (Compendex, Inspec), OVID (PsycINFO), ProQuest (ABI/Inform Global). The search-terms agriculture and agricultural and truncated word-stem agricultur-were employed. Permuted (rotated index) search fields were used to retrieve terms from thesauri. Subject-heading search was assessed in relation to free-text search, based on abstracts and document titles. Findings -All thesauri contain agriculture-based headings; however, associative, hierarchical and synonymous relationships show important inter-database differences. Using subject headings along with abstracts and titles in search syntax (query) sometimes improves retrieval by up to 60 per cent. Retrieval depends on search fields and database-specifics, such as autostemming (lemmatization), explode function, word-indexing, or phrase-indexing. Research limitations/implications -Inter-database and host comparison, on consistent principles, can be limited because of some particular host-and database-specifics. Practical implications -End-users may exploit databases more competently and thus achieve better retrieval results in searching for agriculture-related information. Originality/value -The function of as many as ten databases in different disciplines in providing information relevant to subject matter that is not a topical focus of databases is assessed.

show abstract

Controlled and uncontrolled subject descriptions in the CF database: A comparison of optimal cluster-based retrieval results

Cited by 14 publications

References 18 publications

The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity

The retrieval effectiveness of five clustering algorithms as a function of indexing exhaustivity

The Monte Carlo method and the evaluation of retrieval system performance

Non‐agricultural databases and thesauri

Contact Info

Product

Resources

About