Simple BM25 extension to multiple weighted fields

Robertson, Stephen; Zaragoza, Hugo; Taylor, Michael

doi:10.1145/1031171.1031181

Cited by 522 publications

(381 citation statements)

References 10 publications

Supporting

Mentioning

357

Contrasting

Unclassified

Order By: Relevance

“…Early work treated each field as a smaller document and simply combined field-level scores using linear combination or a mixture of probability models [16]. This straightforward combination of field-level scores was found to have limitations, resulting in efforts such as BM25F [17]. Recently, an adaptation of score combination and smoothing method was suggested [23] for the language modeling approach to IR, based on the search engine Indri [15] which supports combining evidence from multiple fields.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Retrieval experiments using pseudo-desktop collections

Kim

Croft

2009

Proceedings of the 18th ACM Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Desktop search is an important part of personal information management (PIM). However, research in this area has been limited by the lack of shareable test collections, making cumulative progress difficult. In this paper, we define desktop search as a semi-structured document retrieval problem and introduce a methodology to automatically build a reusable collection (the pseudo-desktop) that has many of the same properties as a real desktop collection.We then present a comprehensive evaluation of retrieval methods for semi-structured document retrieval on several pseudo-desktop collections and the TREC Enterprise collection. Our results show that a probabilistic retrieval model using the mapping relation between a query term and a document field (PRM-S) has the best performance in collections with more structure, such as email, and that the query-likelihood language model is better for other document types. We further analyze the observed differences using generated queries and suggest ways to improve PRM-S, which makes the performance gains more significant and consistent.

show abstract

Section: Related Workmentioning

confidence: 99%

“…BM25F [17] is the modification of the BM25 model where field-level evidence is combined at the raw frequency level rather than score level. This maintains non-linear saturation of term frequencies.…”

Section: Bm25fmentioning

confidence: 99%

Retrieval experiments using pseudo-desktop collections

Kim

Croft

2009

Proceedings of the 18th ACM Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…using the term frequency in specific fields of structured documents (e.g. title, abstract) [11], or integrating query-independent evidence in the retrieval model in the form of prior probabilities for a document [3,6] ('prior' because they are known before the query). In short, when determining the relevance between a query and a document, most IR models use primarily query-dependent term statistics, and sometimes also add query-independent evidence to further enhance retrieval performance.…”

Section: Introductionmentioning

confidence: 99%

Probabilistic Document Length Priors for Language Models

Blanco

Barreiro

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.

show abstract

“…Recently, Robertson et al and Zaragoza et al proposed the per-field normalisation technique, which normalises term frequency on a per-field basis [14,18], by extending BM25's normalisation method [13]. The resulting field-based weighting model is called BM25F.…”

Section: Introductionmentioning

confidence: 99%

“…Using BM25F, the retrieval process is performed on indices of different document fields, such as body, title, and anchor text of incoming links. Following [14,18], Macdonald et al extended the PL2 DFR weighting model, by employing the per-field normalisation 2F [10]. Compared with tf normalisation on a single field, on one hand, per-field normalisation can significantly boost the retrieval performance, particularly for Web search [12,18].…”

Section: Introductionmentioning

confidence: 99%

Setting Per-field Normalisation Hyper-parameters for the Named-Page Finding Search Task

Ounis

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained.

show abstract

Simple BM25 extension to multiple weighted fields

Cited by 522 publications

References 10 publications

Retrieval experiments using pseudo-desktop collections

Retrieval experiments using pseudo-desktop collections

Probabilistic Document Length Priors for Language Models

Setting Per-field Normalisation Hyper-parameters for the Named-Page Finding Search Task

Contact Info

Product

Resources

About