Enhancing Search and Browse Using Automated Clustering of Subject Metadata

Hagedorn, Kat; Chapman, Suzanne; Newman, David J.

doi:10.1045/july2007-hagedorn

Cited by 5 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This protocol implements a standardized metadata model for facilitating exchange between repositories. Approaches to document clustering in digital libraries have focused, among other things, on extending search queries and metadata entries of documents (Hagedorn et al , 2007; Rosenberg and Borgman, 1992). In this case, clustering is performed to detect the subject area of documents based on a predefined classification scheme, that is, a closed topic model (Newman et al , 2007).…”

Section: Introductionmentioning

confidence: 99%

Enhancing document modeling by means of open topic models

Mehler

Waltinger

2009

Library Hi Tech

View full text Add to dashboard Cite

We present a topic classification model using the Dewey Decimal Classification (DDC) as the target scheme. This is done by exploring metadata as provided by the Open Archives Initiative (OAI) to derive document snippets as minimal document representations. The reason is to reduce the effort of document processing in digital libraries. Further, we perform feature selection and extension by means of social ontologies and related web-based lexical resources. This is done to provide reliable topic-related classifications while circumventing the problem of data sparseness. Finally, we evaluate our model by means of two language-specific corpora. This paper bridges digital libraries on the one hand and computational linguistics on the other. The aim is to make accessible computational linguistic methods to provide thematic classifications in digital libraries based on closed topic models as the DDC.

show abstract

Section: Introductionmentioning

confidence: 99%

Enhancing document modeling by means of open topic models

Mehler

Waltinger

2009

Library Hi Tech

View full text Add to dashboard Cite

show abstract

“…Examples of unsupervised learning approaches include Krowne and Halbert (2005), who used a text-clustering approach to analyze the title, description and subject fields from the "americansouth.org" digital library, and Newman et al (2007) and Hagedorn et al (2007), who used a statistical topic model to enrich subject metadata in 7.5 million records in the OAIster Digital Library. Recently, Tuarob et al (2013) described a method for generating tags from a domain-specific controlled vocabulary to augment metadata for resources from four different environmental data repositories associated with the DataONE program.…”

mentioning

confidence: 99%

Augmenting Dublin Core digital library metadata with Dewey Decimal Classification

Khoo

Ahn

Binding

et al. 2015

Journal of Documentation

View full text Add to dashboard Cite

Purpose – The purpose of this paper is to describe a new approach to a well-known problem for digital libraries, how to search across multiple unrelated libraries with a single query. Design/methodology/approach – The approach involves creating new Dewey Decimal Classification terms and numbers from existing Dublin Core records. In total, 263,550 records were harvested from three digital libraries. Weighted key terms were extracted from the title, description and subject fields of each record. Ranked DDC classes were automatically generated from these key terms by considering DDC hierarchies via a series of filtering and aggregation stages. A mean reciprocal ranking evaluation compared a sample of 49 generated classes against DDC classes created by a trained librarian for the same records. Findings – The best results combined weighted key terms from the title, description and subject fields. Performance declines with increased specificity of DDC level. The results compare favorably with similar studies. Research limitations/implications – The metadata harvest required manual intervention and the evaluation was resource intensive. Future research will look at evaluation methodologies that take account of issues of consistency and ecological validity. Practical implications – The method does not require training data and is easily scalable. The pipeline can be customized for individual use cases, for example, recall or precision enhancing. Social implications – The approach can provide centralized access to information from multiple domains currently provided by individual digital libraries. Originality/value – The approach addresses metadata normalization in the context of web resources. The automatic classification approach accounts for matches within hierarchies, aggregating lower level matches to broader parents and thus approximates the practices of a human cataloger.

show abstract

“…With various information retrieval (IR) and text categorization (TC, also known as automatic classification) models becoming more and more available for DLs and generating local demand for new, automated solutions [23], in this paper we test a new TC model in a real world setting for the above purpose. As TC research typically uses standard test collections of documents, we replace them by a small database, the institutional repository of the University of Strathclyde, Glasgow, called Strathprints 2 , indexed by the Library of Congress Subject Headings (LCSH).…”

Section: Introductionmentioning

confidence: 99%

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

2012

View full text Add to dashboard Cite

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.

show abstract

Enhancing Search and Browse Using Automated Clustering of Subject Metadata

Cited by 5 publications

References 8 publications

Enhancing document modeling by means of open topic models

Enhancing document modeling by means of open topic models

Augmenting Dublin Core digital library metadata with Dewey Decimal Classification

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Contact Info

Product

Resources

About