“…Since the adoption of manual processes for subject indexing with controlled vocabularies in the 1950s and 1960s, many researchers have explored techniques to automate this process (Borko, ; Klingbiel, ; Stevens & Urban, ). Over the years, the research agenda has remained remarkably stable.…”
Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. . We also introduce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.
“…Since the adoption of manual processes for subject indexing with controlled vocabularies in the 1950s and 1960s, many researchers have explored techniques to automate this process (Borko, ; Klingbiel, ; Stevens & Urban, ). Over the years, the research agenda has remained remarkably stable.…”
Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. . We also introduce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.
“…The citations in this article are illustrative and not comprehensive. The research tends to focus on several contexts for evaluation, including comparison of human and machine classification practices [5, 6, 7, 8, 9, 10];assessment of the variability of classification decisions among human and machine classifiers [11, 12, 13];comparison of machine‐generated classification structures and well‐established classification schemes and thesauri [14, 15, 16, 17, 18];the quality of classification in the context of information retrieval [19, 20, 21]; andevaluations of the quality of statistically generated classes [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]. …”
Stacy Surla is the Bulletin's associate editor for IA. She serves on the IA Institute Board of Directors and is a past chair of the IA Summit. She can be reached at T his article considers how evaluation pertains to taxonomies. Taxonomies and evaluation are both rich concepts, so it is best to start out with some definitions that help to define our discussion. What do we mean by taxonomy? And what do we mean by evaluation?
TaxonomiesFor seasoned information professionals the traditional characterization of a taxonomy is as a hierarchical classification scheme. This characterization has expanded in the last 20 years as the taxonomy community and the information environment have expanded. Today the taxonomy community includes people who design taxonomies, those who build systems that support them and those who use them. Our complex information environment may call for a variety of taxonomic structures, including I flat taxonomies such as lists of languages or lists of countries; I hierarchical taxonomies such as topical or subject classifications, business classifications or service classifications; I faceted taxonomies such as metadata or parametric search structures; I ring taxonomies such as synonyms or authority control data; and I network taxonomies such as fully relationed thesauri or knowledge networks. Each of these structures has its own set of principles and behaviors. And each requires an evaluation method that aligns with those principles and behaviors. This article focuses on the second type of taxonomy -the traditional classification scheme or hierarchical taxonomy. Classification schemes govern the organization of objects into groups according to explicit properties or values. Classification schemes are in widespread use in everyday life -from grocery stores to websites to personal information spaces.
C O N T E N T S N E X T P A G E > N E X T A R T I C L E > < P R E V I O U S P A G E
Special SectionEDITOR'S SUMMARY Direction on the construction and application of classification schemes such as taxonomies is readily available, but relatively little has been offered on evaluating the schemes themselves and their use to categorize content. A classification scheme can be judged for how well it meets its purpose and complies with standards, and a strong evaluative framework is reflected in S.R. Ranganathan's principles of classification. The degree of certainty of classification decisions depends on objective understanding of the object to be classified, the scope and details of the class and the coverage and organization of the overall classification scheme. The more complete the information about each class, the more reliable the goodness-of-fit for an object to a class is likely to be, whether chosen by human or machine classifiers. This information comes through definitions, examples, prior use and semantic relationships. The risk of misclassification can be reduced by analyzing the goodness-of-fit of objects to classes and the patterns of missed or erroneous selections.
KEYWORDS
“…In 1960, Borko [8] used the principle of factor analysis to develop clusters for a 90 X 90 correlation matrix. Stiles and Salisbury [42] have utilized a so-called Bcoefficient to subdivide term profiles into distinct sets.…”
Section: Related Work In Cluster Analysismentioning
ABSTRACT. Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. "Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.For threshold levels T > 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.