A major barrier to successful retrieval from external sources (e.g., electronic databases) is the tremendous variability in the words that people use to describe objects of interest. The fact that different authors use different words to describe essentially the same idea means that relevant objects will be missed; conversely, the fact that the same word can be used to refer to many different things means that irrelevant objects will be retrieved. We describe a statistical method called latent semantic indexing, which models the implicit higher order structure in the association of words and objects and improves retrieval performance by up to 30%. Additional large performance improvements of 40% and 67% can be achieved through the use of differential term weighting and iterative retrieval methods.Although much research in cognitive psychology has been devoted to the question of how people retrieve information from their own memories, much less work has been done on the issue of the retrieval of information from external sources such as other people, books, libraries, or electronic databases. One problem that is immediately evident in attempting to retrieve information from external sources is the mismatch between the searcher's language and that of the target information. How often have you looked in the index of a book or the library catalog and been unable to find what you wanted? This problem is not evident in traditional memory modeling, because memory probes are in the same language as the memory representation.Most approaches to the retrieval of electronically available textual materials depend on a lexical match between words in users' requests and those in database objects. Typically only text objects that contain one or more words in common with those in the users' query are returned as relevant. Word-based retrieval systems like this are, however, far from ideal-many objects relevant to a users' query are missed, and many unrelated or irrelevant materials are retrieved. A particularly salient example of the failure to find relevant materials is reported by Blair and Maron (1985) in a study of a state-of-the-art on-line legal retrieval system. Two lawyers, with the aid of an expert search intermediary, searched the database for all materials relevant to a case they were litigating. The system contained the full text of 40,000 documents, corresponding to roughly 350,000 pages of text. The lawyers were asked to search until they thought they had found 75% of the relevant materials. The surprising result was that they found only 20% of the known relevant materials.Correspondence should be addressed to Susan T. Dumais, Bellcore, 445 South St., Room 2L-371, Morristown, NJ 07962-1910.
229We believe that fundamental characteristics of human verbal behavior underlie these retrieval difficulties. Furnas, Landauer, Gomez, and Dumais (1987), for example, have shown that people generate the same main keyword to describe well-known objects less than 20% of the time. Comparably poor agreement has been reported in stu...