Introduced is a new approach to Information Retrieval developed on the basis of Hidden Markov Models (HMMs). HMMs are shown to provide a mathematically sound framework for retrieving documents-documents with predefined boundaries and also entities of information that are of arbitrary lengths and formats (passage retrieval). Our retrieval model is shown to encompass promising capabilities: First, the position of occurrences of indexing features can be used for indexing. Positional information is essential, for instance, when considering phrases, negation, and the proximity of features. Second, from training collections we can derive automatically optimal weights for arbitrary features. Third, a query dependent structure can be determined for every document by segmenting the documents into passages that are either relevant or irrelevant to the query. The theoretical analysis of our retrieval model is complemented by the results of preliminary experiments.
Introd uctionWe introduce a new approach to Information Retrieval, i.e. document retrieval and passage retrieval. Documents are considered as being produced by stochastic processes. A first stochastic process generates text fragments that are relevant to a certain query. A second stochastic process generates text fragments independent of any particular query. The generation of text fragments by the two stochastic processes is modeled by means of two Hidden Markov Models (HMMs). Whether one of these two HMMs generates a text fragment with a high probability or with a low probability depends on the distribution of the query features within the text fragment.In the case of document retrieval, the documents are assigned scores that depend on the ratio of the probability that the document was generated by the first stochastic process and the probability that the document was generated by the second stochastic process. As usual, the documents are presented to the user in decreasing order of their scores. In the case of passage retrieval, the score of a passage depends on the probability that the passage itself was generated by the first stochastic process and the text fragments before and after the passage are generated by the second stochastic process.There are three problems, that are considered difficult in Information Retrieval. First, it is not well known how complex features (e.g. phrases, proximity data, negations, cooccurrence and cocitation data etc.) should be used for i~dexing. Second, we lack a general weighting scheme for arbitrary indexing features and for arbitrary document collections. Third, the optimal segmentation of a long document into segments that are either relevant or irrelevant to a query is another open problem. Our approach encompasses promising capabilities to solve these three problems at least partially. First, information about positions of features can be conserved, because in our approach a document is considered as being produced by a stochastic process. Conventional retrieval models do not take into account the positions where...
This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query feature; a method to estimate occurrence probabilities; a technique that we call collection-wide probability re-estimation and a w eighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval e ectiveness with respect to a baseline retrieval method. Results show that the retrieval e ectiveness can be improved considerably despite the large number of speech recognition errors.
We present an information retrieval system that simultaneously allows to search for text and speech documents. The retrieval system accepts vague queries and performs a best-match search to find those documents that are relevant to the query. The output of the retrieval system is a list of ranked documents where the documents on the top of the list satisfy best the user's information need. The relevance of the documents is estimated by means of metadata (document description vectors). The metadata is automatically generated and it is organized such that queries can be processed efficiently. We introduce a controlled indexing vocabulary for both speech and text documents. The size of the new indexing vocabulary is small (1000 features) compared with the sizes of indexing vocabularies of conventional text retrieval (10000 -100000 features). We show that the retrieval effectiveness based on such a small indexing vocabulary is similar to the retrieval effectiveness of a Boolean retrieval system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.