Two-Stage Document Length Normalization for Information Retrieval

Na, Seung-Hoon

doi:10.1145/2699669

Cited by 12 publications

(7 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Interestingly, recent research has developed a two-stage document length normalisation framework [Na 2015] which incorporates both verbosity and scope normalisation into retrieval methods. It is appealing that the SPUD retrieval methods derived from our probabilistic framework contain these aspects of normalisation naturally.…”

Section: Theoretical Discussionmentioning

confidence: 99%

A Pólya Urn Document Language Model for Improved Information Retrieval

Cummins

Paik

2015

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

We introduce a generalised multivariate Pólya process for document language modelling. The framework outlined here generalises a number of statistical language models used in information retrieval for modelling document generation. In particular, we show that the choice of replacement matrix M ultimately defines the type of random process and therefore defines a particular type of document language model. We show that a particular variant of the general model is useful for modelling termspecific burstiness. Furthermore, via experimentation we show that this variant significantly improves retrieval effectiveness on a number of small test collections.1 such that the mass of the urn never decreases 2 a vector that is a 1 in dimension t i and 0 elsewhere

show abstract

Section: Theoretical Discussionmentioning

confidence: 99%

A Pólya Urn Document Language Model for Improved Information Retrieval

Cummins

Paik

2015

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

show abstract

“…Fang et al [14] also proposed the use of perturbed document collections to gather further insights on retrieval functions fulfilling the same set of axioms. This approach has not been followed-up upon in works other than [31].…”

Section: Related Workmentioning

confidence: 99%

An Axiomatic Approach to Diagnosing Neural IR Models

Rennings

Moraes

Hauff

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Traditional retrieval models such as BM25 or language models have been engineered based on search heuristics that later have been formalized into axioms. The axiomatic approach to information retrieval (IR) has shown that the effectiveness of a retrieval method is connected to its fulfillment of axioms. This approach enabled researchers to identify shortcomings in existing approaches and "fix" them. With the new wave of neural net based approaches to IR, a theoretical analysis of those retrieval models is no longer feasible, as they potentially contain millions of parameters. In this paper, we propose a pipeline to create diagnostic datasets for IR, each engineered to fulfill one axiom. We execute our pipeline on the recently released large-scale question answering dataset WikiPassageQA (which contains over 4000 topics) and create diagnostic datasets for four axioms. We empirically validate to what extent wellknown deep IR models are able to realize the axiomatic pattern underlying the datasets. Our evaluation shows that there is indeed a positive relation between the performance of neural approaches on diagnostic datasets and their retrieval effectiveness. Based on these findings, we argue that diagnostic datasets grounded in axioms are a good approach to diagnosing neural IR models.

show abstract

“…where k, k 0 , and b are constants and Δ is the average document length. The TF component of the BM25 ranking function incorporates document length normalization, which ensures long documents are not excessively favored over short documents in retrieval (Na, 2015;Singhal et al, 1996). Instead of a simple normalization by the document length j d j, the normalization in BM25 takes into account that the length of a document may depend on the document's verbosity and scope (Robertson & Walker, 1994).…”

Section: Related Workmentioning

confidence: 99%

“…A document d may be represented as a vector with each dimension and its value corresponding to a term t in d and the TF, respectively. The use of TF normalized by the document length in ranking functions can enhance retrieval effectiveness (Na, 2015;Singhal et al, 1996). We denote the normalized TF by f t, d ð Þ and represent the document with normalized TF, d, by a set of tuples, that is,…”

Section: Theoretical Foundationmentioning

confidence: 99%

A retrieval model family based on the probability ranking principle for ad hoc retrieval

Dang

Luk

Allan

2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

Many successful retrieval models are derived based on or conform to the probability ranking principle (PRP). We present a new derivation of a document ranking function given by the probability of relevance of a document, conforming to the PRP. Our formulation yields a family of retrieval models, called probabilistic binary relevance (PBR) models, with various instantiations obtained by different probability estimations. By extensive experiments on a range of TREC collections, improvement of the PBR models over some established baselines with statistical significance is observed, especially in the large Clueweb09 Cat-B collection.

show abstract

Two-Stage Document Length Normalization for Information Retrieval

Cited by 12 publications

References 57 publications

A Pólya Urn Document Language Model for Improved Information Retrieval

A Pólya Urn Document Language Model for Improved Information Retrieval

An Axiomatic Approach to Diagnosing Neural IR Models

A retrieval model family based on the probability ranking principle for ad hoc retrieval

Contact Info

Product

Resources

About