2015
DOI: 10.1145/2746231
|View full text |Cite
|
Sign up to set email alerts
|

A Pólya Urn Document Language Model for Improved Information Retrieval

Abstract: We introduce a generalised multivariate Pólya process for document language modelling. The framework outlined here generalises a number of statistical language models used in information retrieval for modelling document generation. In particular, we show that the choice of replacement matrix M ultimately defines the type of random process and therefore defines a particular type of document language model. We show that a particular variant of the general model is useful for modelling termspecific burstiness. Fu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
27
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 29 publications
(33 citation statements)
references
References 66 publications
1
27
0
Order By: Relevance
“…Song et al proposed a customized learning-to-rank algorithm and a query term position-based re-ranking model to improve the retrieval performance [28]. As biomedical articles are usually fulltext scientific articles which are much longer than Web documents, Cummins et al applied the recently proposed SPUD language model [10] to CDS for retrieving long documents in a balanced way [9]. Abacha and Khelifi investigated several query reformulation methods utilizing Mesh and DBpedia.…”
Section: State-of-the-art Cds Methodsmentioning
confidence: 99%
“…Song et al proposed a customized learning-to-rank algorithm and a query term position-based re-ranking model to improve the retrieval performance [28]. As biomedical articles are usually fulltext scientific articles which are much longer than Web documents, Cummins et al applied the recently proposed SPUD language model [10] to CDS for retrieving long documents in a balanced way [9]. Abacha and Khelifi investigated several query reformulation methods utilizing Mesh and DBpedia.…”
Section: State-of-the-art Cds Methodsmentioning
confidence: 99%
“…We implement and report a similar term-selection scheme using the Dirichletcompound-multinomial (PDCM) as a generative model of the top |F | documents as a baseline. As advances in document modelling are likely to yield improvements for principled PRF approaches, we also adopt a recently developed document language model based on the multivariate Pólya distribution [4]. A detailed comparative study [8] into PRF approaches reports that both RM3 and SMM achieve comparable performance but that RM3 has more stable parameter settings (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…where m d is the number of word-types (distinct terms) in d, c(t, d) is the count of term t in document d, |d| is the number of word tokens in d, df t is the document frequency of term t in the collection C, and m c is a background mass parameter that can be estimated via numerical methods (see [4] for details). The scale parameters m d and m c can be interpreted as beliefs in the parameters c(t, d)/|d| and df t / t ′ df t ′ respectively.…”
Section: Smoothed Pólya Urn Document Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…This phenomenon is known as word burstiness [18] and is a type of dependency that is not modelled in the multinomial language model. Cummins et al [8] present a Smoothed Polya Urn Document language model, which incorporates word burstiness only into the document model. They use the Dirichlet compound multinomial (DCM) to model documents in place of the standard multinomial distribution, whereas the standard multinomial is used to model query generation.…”
Section: Language Modelmentioning
confidence: 99%