Introduced is a new approach to Information Retrieval developed on the basis of Hidden Markov Models (HMMs). HMMs are shown to provide a mathematically sound framework for retrieving documents-documents with predefined boundaries and also entities of information that are of arbitrary lengths and formats (passage retrieval). Our retrieval model is shown to encompass promising capabilities: First, the position of occurrences of indexing features can be used for indexing. Positional information is essential, for instance, when considering phrases, negation, and the proximity of features. Second, from training collections we can derive automatically optimal weights for arbitrary features. Third, a query dependent structure can be determined for every document by segmenting the documents into passages that are either relevant or irrelevant to the query. The theoretical analysis of our retrieval model is complemented by the results of preliminary experiments. Introd uctionWe introduce a new approach to Information Retrieval, i.e. document retrieval and passage retrieval. Documents are considered as being produced by stochastic processes. A first stochastic process generates text fragments that are relevant to a certain query. A second stochastic process generates text fragments independent of any particular query. The generation of text fragments by the two stochastic processes is modeled by means of two Hidden Markov Models (HMMs). Whether one of these two HMMs generates a text fragment with a high probability or with a low probability depends on the distribution of the query features within the text fragment.In the case of document retrieval, the documents are assigned scores that depend on the ratio of the probability that the document was generated by the first stochastic process and the probability that the document was generated by the second stochastic process. As usual, the documents are presented to the user in decreasing order of their scores. In the case of passage retrieval, the score of a passage depends on the probability that the passage itself was generated by the first stochastic process and the text fragments before and after the passage are generated by the second stochastic process.There are three problems, that are considered difficult in Information Retrieval. First, it is not well known how complex features (e.g. phrases, proximity data, negations, cooccurrence and cocitation data etc.) should be used for i~dexing. Second, we lack a general weighting scheme for arbitrary indexing features and for arbitrary document collections. Third, the optimal segmentation of a long document into segments that are either relevant or irrelevant to a query is another open problem. Our approach encompasses promising capabilities to solve these three problems at least partially. First, information about positions of features can be conserved, because in our approach a document is considered as being produced by a stochastic process. Conventional retrieval models do not take into account the positions where...
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.