This paper addresses the problem of automatically assigning a Library of Congress Classification (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCCs are organized in a tree:The root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas defined. We describe a procedure that, given a resource identified by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a model that maps from sets of LCSH to classifications from the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.
IntroductionThe Library of Congress Classification (LCC) is a hierarchical set of topic descriptors used to categorize the intellectual content of a work, to situate the work relative to others in the tree of knowledge, and (more prosaically) to place books on shelves. Because LCC 1 are media-independent, they can be assigned to resources in digital and virtual libraries, providing compatibility with traditional resources and an access method familiar to librarians from the United States and around the world.INFOMINE (http://infomine.ucr.edu/) is a virtual library of over 20,000 scholarly Internet resources maintained cooperatively by and for librarians. Each record was manually created by a librarian, and describes a resource with standard library cataloging techniques, including controlled terms from the Library of Congress Subject Headings (LCSH). The INFOMINE Project requires a hierarchical classification for each resource, but in a collection this large (and growing) it is logistically impossible to assign such metadata manually.This lack defines our problem: we wish to automatically assign a hierarchical classification to each INFOMINE record based on its extant metadata. Specifically, we will learn to assign a classification from the LCC Outline to a resource based on a set of LCSH that describe that resource. The solution we describe uses machine learning techniques and training data from an academic library catalog to learn a classification model that maps from sets of LCSH to nodes in the LCC tree.The problem is complicated by the large number of potential classifications: most machine learning problems deal with hundreds of classes at most, but there are thousands of LCC. For this reason, prior work treats LCC classification as an information retrieval task: virtual documents are created representing each class, and new examples are classified by using similarity measures to find the most similar "documents" (Larson, 1992;Dolin, 1998;Thompson, Shafer, & Vizine-Goetz, 1997). The hierarchical nature of the LCC is largely ignored.Our solution addresses the problem by exploiting its hierarchical nature. A pairwise linear classifier is learned for every node in the LCC hierarchy that classifies an example as belonging to that node or belonging to one of it...