The Library of Congress Classification as a Knowledge Base for Automatic Subject Categorization

Godby, Carol Jean; Stuler, Jay

doi:10.1515/9783110964912.163

Cited by 6 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(LCSH), Dewey Decimal Classification (DDC), and Universal Decimal Classification (UDC) [20][21][22]. Existing document classification schemes are well established and lead to effective manual document classification.…”

Section: Introductionmentioning

confidence: 99%

“…The GERHARD project used the UDC scheme, while the Scorpion project employed the DDC scheme [23]. An experimental automatic document classification system was also built using the LCC scheme [22]. However, none of these document classification systems was constructed based on machine learning techniques.…”

Section: Introductionmentioning

confidence: 99%

“…Comparatively speaking, the networked computing environment and the methods for electronic document description and organization are still evolving [20]. In the literature of library and information science, the need to combine electronic documents with traditional library materials has inspired continuous discussions on the refinement of existing manual classification schemes such as the Library of Congress Classification (LCC), Library of Congress Subject Headings (LCSH), Dewey Decimal Classification (DDC), and Universal Decimal Classification (UDC) [20][21][22]. Existing document classification schemes are well established and lead to effective manual document classification.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A comparative study of two automatic document classification methods in a library setting

Pong

Kwok

Lau

et al. 2007

Journal of Information Science

View full text Add to dashboard Cite

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comparative study of two automatic document classification methods in a library setting

Pong

Kwok

Lau

et al. 2007

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…Scorpion is a program developed by the OCLC Project to assign DDC to Web resources and other full‐text documents that has been adapted to use the LCC (Thompson et al, 1997; Godby & Stuler, 2003). Like Larson's system, it works by creating virtual documents representing each of the possible classifications and using information retrieval measures to compare new examples to the virtual documents.…”

Section: Introductionmentioning

confidence: 99%

“…First, the virtual documents representing possible classifications are not created by clustering similarly classified documents together, instead they are generated from the LCC hierarchy. Starting with the full LCC, those classifications whose textual descriptions contain country names or generic names, or cross‐references to other classifications are removed: in experiments with the Q , R , S , and T schedules 91% of the classifications are eliminated, leaving 6,314 classifications (Godby & Stuler, 2003). Second, virtual documents for each LCC are derived by selecting co‐occuring terms from OCLC's World Cat (a database of bibliographic records) and from the Library of Congress Subject Authority (a database of canonical names and terms).…”

Section: Introductionmentioning

confidence: 99%

Predicting Library of Congress classifications from Library of Congress subject headings

Frank

Paynter

2003

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCCs are organized in a tree:The root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a model that maps from sets of LCSH to classifications from the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs. IntroductionThe Library of Congress Classification (LCC) is a hierarchical set of topic descriptors used to categorize the intellectual content of a work, to situate the work relative to others in the tree of knowledge, and (more prosaically) to place books on shelves. Because LCC 1 are media-independent, they can be assigned to resources in digital and virtual libraries, providing compatibility with traditional resources and an access method familiar to librarians from the United States and around the world.INFOMINE (http://infomine.ucr.edu/) is a virtual library of over 20,000 scholarly Internet resources maintained cooperatively by and for librarians. Each record was manually created by a librarian, and describes a resource with standard library cataloging techniques, including controlled terms from the Library of Congress Subject Headings (LCSH). The INFOMINE Project requires a hierarchical classification for each resource, but in a collection this large (and growing) it is logistically impossible to assign such metadata manually.This lack defines our problem: we wish to automatically assign a hierarchical classification to each INFOMINE record based on its extant metadata. Specifically, we will learn to assign a classification from the LCC Outline to a resource based on a set of LCSH that describe that resource. The solution we describe uses machine learning techniques and training data from an academic library catalog to learn a classification model that maps from sets of LCSH to nodes in the LCC tree.The problem is complicated by the large number of potential classifications: most machine learning problems deal with hundreds of classes at most, but there are thousands of LCC. For this reason, prior work treats LCC classification as an information retrieval task: virtual documents are created representing each class, and new examples are classified by using similarity measures to find the most similar "documents" (Larson, 1992;Dolin, 1998;Thompson, Shafer, & Vizine-Goetz, 1997). The hierarchical nature of the LCC is largely ignored.Our solution addresses the problem by exploiting its hierarchical nature. A pairwise linear classifier is learned for every node in the LCC hierarchy that classifies an example as belonging to that node or belonging to one of it...

show abstract

An extensive study on automated Dewey Decimal Classification

Wang

2009

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.

show abstract

The Library of Congress Classification as a Knowledge Base for Automatic Subject Categorization

Cited by 6 publications

References 0 publications

A comparative study of two automatic document classification methods in a library setting

A comparative study of two automatic document classification methods in a library setting

Predicting Library of Congress classifications from Library of Congress subject headings

An extensive study on automated Dewey Decimal Classification

Contact Info

Product

Resources

About