The classical probabilistic models attempt to capture the Ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution , which relies on hierarchical Bayesian modeling techniques, or the Polya Urn scheme, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the latent mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks.
Relevance feedback, which traditionally uses the terms in the relevant documents to enrich the user's initial query, is an effective method for improving retrieval performance. The traditional relevance feedback algorithms lead to overfitting because of the limited amount of training data and large term space. This paper introduces an online Bayesian logistic regression algorithm to incorporate relevance feedback information. The new approach addresses the overfitting problem by projecting the original feature space onto a more compact set which retains the necessary information.The new set of features consist of the original retrieval score, the distance to the relevant documents and the distance to non-relevant documents. To reduce the human evaluation effort in ascertaining relevance, we introduce a new active learning algorithm based on variance reduction to actively select documents for user evaluation. The new active learning algorithm aims to select feedback documents to reduce the model variance. The variance reduction approach leads to capturing relevance, diversity and uncertainty of the unlabeled documents in a principled manner. These are the critical factors of active learning indicated in previous literature. Experiments with several TREC datasets demonstrate the effectiveness of the proposed approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.