Term position‐based language model for information retrieval

Hammache, Arezki

doi:10.1002/asi.24431

Cited by 7 publications

(5 citation statements)

References 31 publications

(63 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, it is necessary to evaluate the effect of going beyond such assumptions by experiments. Hence, possible topics of future study include combining PBR models with term association methods, such as cross terms (Zhao et al, 2014) or MRF techniques (Metzler & Croft, 2005) that incorporate various degrees of term dependencies, and the use of term position features (Hammache & Boughanem, 2021). Furthermore, we have ignored queryindependent features (by Assumption 3).…”

Section: Discussionmentioning

confidence: 99%

“…Apart from these models, some approaches that go beyond bag-of-words have been shown to be effective. These include the cross term approach that models term association (Zhao et al, 2014), term proximity matching methods such as Markov random field (MRF) (Metzler & Croft, 2005), and also the use of term position features (Hammache & Boughanem, 2021). As the objective of this paper is to demonstrate the effectiveness of the new bag-of-words PBR models in a pilot study, we restrict to bag-of-words baselines for a fair comparison.…”

Section: Other Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A retrieval model family based on the probability ranking principle for ad hoc retrieval

Dang

Luk

Allan

2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

Many successful retrieval models are derived based on or conform to the probability ranking principle (PRP). We present a new derivation of a document ranking function given by the probability of relevance of a document, conforming to the PRP. Our formulation yields a family of retrieval models, called probabilistic binary relevance (PBR) models, with various instantiations obtained by different probability estimations. By extensive experiments on a range of TREC collections, improvement of the PBR models over some established baselines with statistical significance is observed, especially in the large Clueweb09 Cat-B collection.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Other Methodsmentioning

confidence: 99%

A retrieval model family based on the probability ranking principle for ad hoc retrieval

Dang

Luk

Allan

2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

show abstract

“…As the documents in the MLIA corpus are long, we only consider the first N sentences for inference, where N denotes average number of sentences in documents of the corpus. Moreover, previous research [ 29 , 30 ] shows that any relevant document is likely to contain relevant sentences at the beginning of the document. The document-level relevance score is determined by aggregating the top k scoring sentences in the document as follows: where BiScore(d) is the document-level relevance score for document d using the bi-encoder model.…”

Section: Multistage Bicross Encodermentioning

confidence: 99%

Multistage BiCross encoder for multilingual access to COVID-19 health information

2021

View full text Add to dashboard Cite

The Coronavirus (COVID-19) pandemic has led to a rapidly growing ‘infodemic’ of health information online. This has motivated the need for accurate semantic search and retrieval of reliable COVID-19 information across millions of documents, in multiple languages. To address this challenge, this paper proposes a novel high precision and high recall neural Multistage BiCross encoder approach. It is a sequential three-stage ranking pipeline which uses the Okapi BM25 retrieval algorithm and transformer-based bi-encoder and cross-encoder to effectively rank the documents with respect to the given query. We present experimental results from our participation in the Multilingual Information Access (MLIA) shared task on COVID-19 multilingual semantic search. The independently evaluated MLIA results validate our approach and demonstrate that it outperforms other state-of-the-art approaches according to nearly all evaluation metrics in cases of both monolingual and bilingual runs.

show abstract

“…For example, the challenge of vocabulary mismatch, and hence the importance of semantic matching, may be amplified when retrieving shorter text [95][96][97]. Similarly, when matching the query against longer text, it is informative to consider the positions of the matches [98][99][100], but may be less so in the case of short text matching. When specifically dealing with long text, the compute and memory requirements may be significantly higher for machine learned systems (e.g., [101]) and require careful design choices for mitigation.…”

Section: Robustness To Variable Length Textmentioning

confidence: 99%

Neural methods for effective, efficient, and exposure-aware information retrieval

Mitra

2021

SIGIR Forum

View full text Add to dashboard Cite

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

show abstract

Term position‐based language model for information retrieval

Cited by 7 publications

References 31 publications

A retrieval model family based on the probability ranking principle for ad hoc retrieval

A retrieval model family based on the probability ranking principle for ad hoc retrieval

Multistage BiCross encoder for multilingual access to COVID-19 health information

Neural methods for effective, efficient, and exposure-aware information retrieval

Contact Info

Product

Resources

About