Abstract:A major barrier to successful retrieval from external sources (e.g., electronic databases) is the tremendous variability in the words that people use to describe objects of interest. The fact that different authors use different words to describe essentially the same idea means that relevant objects will be missed; conversely, the fact that the same word can be used to refer to many different things means that irrelevant objects will be retrieved. We describe a statistical method called latent semantic indexin… Show more
“…A brief overview ofLSA will be provided here. More complete descriptions of LSA may be found in Deerwester, Dumais, Furnas, Landauer, & Harshman (1990) and Dumais (1990).…”
Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of semantic similarity between pieces of textual information. This paper summarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for analyzinga subject's essay for determining from what text a subject learned the information and for grading the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts.One of the primary goals in text-comprehension research is to understand what factors influence a reader's ability to extract and retain information from textual material. The typical approach in text-comprehension research is to have subjects read textual material and then have them produce some form of summary, such as answering questions or writing an essay. This summary permits the experimenter to determine what information the subject has gained from the text.To analyze what a subject has learned from a text, the task of the experimenter is to relate what was in the summary to what the subject has read. This permits the subject's representation (cognitive model) of the text to be compared with the representation expressed in the original text. For such an analysis, the experimenter must examine each sentence in the subject's summary and match the information contained in the sentence to the information contained in the texts that were read. Information in the summary that is highly related to information from the texts would indicate that it was likely learned from the text. Nevertheless, matching this information is not easy. It requires scanning through the original texts to locate the information. In addition, since subjects do not write exactly the same words as those that they have read, it is not possible to look for exact matches. Instead, the experimenter must make the match on the basis of the semantic content of the text.. This work has benefited from collaborative research with
“…A brief overview ofLSA will be provided here. More complete descriptions of LSA may be found in Deerwester, Dumais, Furnas, Landauer, & Harshman (1990) and Dumais (1990).…”
Latent semantic analysis (LSA) is a statistical model of word usage that permits comparisons of semantic similarity between pieces of textual information. This paper summarizes three experiments that illustrate how LSA may be used in text-based research. Two experiments describe methods for analyzinga subject's essay for determining from what text a subject learned the information and for grading the quality of information cited in the essay. The third experiment describes using LSAto measure the coherence and comprehensibility of texts.One of the primary goals in text-comprehension research is to understand what factors influence a reader's ability to extract and retain information from textual material. The typical approach in text-comprehension research is to have subjects read textual material and then have them produce some form of summary, such as answering questions or writing an essay. This summary permits the experimenter to determine what information the subject has gained from the text.To analyze what a subject has learned from a text, the task of the experimenter is to relate what was in the summary to what the subject has read. This permits the subject's representation (cognitive model) of the text to be compared with the representation expressed in the original text. For such an analysis, the experimenter must examine each sentence in the subject's summary and match the information contained in the sentence to the information contained in the texts that were read. Information in the summary that is highly related to information from the texts would indicate that it was likely learned from the text. Nevertheless, matching this information is not easy. It requires scanning through the original texts to locate the information. In addition, since subjects do not write exactly the same words as those that they have read, it is not possible to look for exact matches. Instead, the experimenter must make the match on the basis of the semantic content of the text.. This work has benefited from collaborative research with
Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.
“…This system is called Latent Semantic Indexing (LSI) [Dum91] and was the product of Susan Dumais, then at Bell Labs. LSI simply creates a low rank approximation A k to the term-by-document matrix A from the vector space model.…”
Section: Latent Semantic Indexingmentioning
confidence: 99%
“…6. [Dum91], [BB05], [BR99], [Ber01], [BDJ99] LSI is known to outperform the vector space model in terms of precision and recall. 7.…”
Section: [Mey00] If the Term-by-document Matrix A M×n Has The Singulamentioning
confidence: 99%
“…7. [BR99], [Ber01], [BB05], [BF96], [BDJ99], [BO98], [Blo99], [BR01], [Dum91], [HB00], [JL00], [JB00], [LB97], [WB98], [ZBR01], [ZMS98] LSI and the truncated singular value decomposition dominated text mining research in the 1990s. 8.…”
Section: [Mey00] If the Term-by-document Matrix A M×n Has The Singulamentioning
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.