A Pólya Urn Document Language Model for Improved Information Retrieval

Cummins, Ronan; Paik, Jiaul H.; Lv, Yuanhua

doi:10.1145/2746231

Cited by 29 publications

(33 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Song et al proposed a customized learning-to-rank algorithm and a query term position-based re-ranking model to improve the retrieval performance [28]. As biomedical articles are usually fulltext scientific articles which are much longer than Web documents, Cummins et al applied the recently proposed SPUD language model [10] to CDS for retrieving long documents in a balanced way [9]. Abacha and Khelifi investigated several query reformulation methods utilizing Mesh and DBpedia.…”

Section: State-of-the-art Cds Methodsmentioning

confidence: 99%

A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support

Yang

et al. 2017

Data Sci. Eng.

View full text Add to dashboard Cite

Clinical Decision Support (CDS) is widely seen as an information retrieval (IR) application in the medical domain. The goal of CDS is to help physicians find useful information from a collection of medical articles with respect to the given patient records, in order to take the best care of their patients. Most of the existing CDS methods do not sufficiently consider the semantic relation between texts, hence the potential in improving the performance in biomedical articles retrieval. This paper proposes a novel feedback-based approach which considers the semantic association between a retrieved biomedical article and a pseudo feedback set. Evaluation results show that our method outperforms the strong baselines and is able to improve over the best runs in the TREC CDS tasks.

show abstract

Section: State-of-the-art Cds Methodsmentioning

confidence: 99%

A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support

Yang

et al. 2017

Data Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…We implement and report a similar term-selection scheme using the Dirichletcompound-multinomial (PDCM) as a generative model of the top |F | documents as a baseline. As advances in document modelling are likely to yield improvements for principled PRF approaches, we also adopt a recently developed document language model based on the multivariate Pólya distribution [4]. A detailed comparative study [8] into PRF approaches reports that both RM3 and SMM achieve comparable performance but that RM3 has more stable parameter settings (i.e.…”

Section: Related Workmentioning

confidence: 99%

“…where m d is the number of word-types (distinct terms) in d, c(t, d) is the count of term t in document d, |d| is the number of word tokens in d, df t is the document frequency of term t in the collection C, and m c is a background mass parameter that can be estimated via numerical methods (see [4] for details). The scale parameters m d and m c can be interpreted as beliefs in the parameters c(t, d)/|d| and df t / t ′ df t ′ respectively.…”

Section: Smoothed Pólya Urn Document Modelmentioning

confidence: 99%

“…We show that this new approach outperforms the original relevance modelling approach to query expansion and also adheres to a number of recently proposed constraints [3] regarding the term-selection function for PRF. Furthermore, we adopt a recently developed document language model [4] that assumes that documents are generated from a mixture of multivariate Pólya distributions (aka. the Dirichlet-compoundmultinomial).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improved Query-Topic Models Using Pseudo-Relevant Pólya Document Models

Cummins

2017

Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

Self Cite

View full text Add to dashboard Cite

Query-expansion via pseudo-relevance feedback is a popular method of overcoming the problem of vocabulary mismatch and of increasing average retrieval effectiveness. In this paper, we develop a new method that estimates a query topic model from a set of pseudo-relevant documents using a new language modelling framework. We assume that documents are generated via a mixture of multivariate Pólya distributions, and we show that by identifying the topical terms in each document, we can appropriately select terms that are likely to belong to the query topic model. The results of experiments on several TREC collections show that the new approach compares favourably to current state-of-the-art expansion methods.

show abstract

“…This phenomenon is known as word burstiness [18] and is a type of dependency that is not modelled in the multinomial language model. Cummins et al [8] present a Smoothed Polya Urn Document language model, which incorporates word burstiness only into the document model. They use the Dirichlet compound multinomial (DCM) to model documents in place of the standard multinomial distribution, whereas the standard multinomial is used to model query generation.…”

Section: Language Modelmentioning

confidence: 99%

A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution

Paik

2015

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Self Cite

View full text Add to dashboard Cite

The main goal of a retrieval model is to measure the degree of relevance of a document with respect to the given query. Probabilistic models are widely used to measure the likelihood of relevance of a document by combining within document term frequency and term specificity in a formal way. Recent research shows that tf normalization that factors in multiple aspects of term salience is an effective scheme. However, existing models do not fully utilize these tf normalization components in a principled way. Moreover, most state of the art models ignore the distribution of a term in the part of the collection that contains the term. In this article, we introduce a new probabilistic model of ranking that addresses the above issues. We argue that, since the relevance of a document increases with the frequency of the query term, this assumption can be used to measure the likelihood that the normalized frequency of a term in a particular document will be maximum with respect to its distribution in the elite set. Thus, the weight of a term in a document is proportional to the probability that the normalized frequency of that term is maximum under the hypothesis that the frequencies are generated randomly. To that end, we introduce a ranking function based on maximum value distribution that uses two aspects of tf normalization. The merit of the proposed model is demonstrated on a number of recent large web collections. Results show that the proposed model outperforms the state of the art models by significantly large margin.

show abstract

A Pólya Urn Document Language Model for Improved Information Retrieval

Cited by 29 publications

References 66 publications

A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support

A Feedback-Based Approach to Utilizing Embeddings for Clinical Decision Support

Improved Query-Topic Models Using Pseudo-Relevant Pólya Document Models

A Probabilistic Model for Information Retrieval Based on Maximum Value Distribution

Contact Info

Product

Resources

About