Learning to Re-Rank with Contextualized Stopwords

Hofstätter, Sebastian; Lipani, Aldo; Zlabinger, Markus; Hanbury, Allan

doi:10.1145/3340531.3412079

Cited by 4 publications

(7 citation statements)

References 13 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To further reduce the number of passage tokens to store, we adopt a simplified version of Hofstätter et al [17]'s contextualized stopwords (CS), which was first introduced for the TK-Sparse model. CS learns a removal gate of tokens solely based on their contextdependent vector representations.…”

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

“…The original implementation [17] masks scores after TK's kernelactivation, meaning the non-zero gates have to be saved as well, which increases the systems' complexity. In contrast, we directly apply the gate to the representation vectors.…”

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

“…Applying the stopword gate directly to the representation vector allows us to observe much more stable training than the authors of TK-Sparse observed -we do not need to adapt the training procedure with special mechanisms to keep the model from collapsing. Following Hofstätter et al [17] we train the removal gate with a regularization loss, which forces the stopword removal gate to become active during training as described in §4.1.…”

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

“…In order to further reduce the number of stored vectors we learn to reduce the BOW 2 representations with simplified contextualized stopwords (CS) [17]. To maximally reduce the dimensionality of the token vectors to one, we also employ an Exact Matching (EM) component, which matches only the vector representations of whole-words which are exact matches from the query.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction

Hofstätter¹,

Khattab²,

Althammer³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reductions dramatically lower ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multivector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors per document by learning unique whole-word representations for the terms in each document and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5×, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality outof-domain collections, yielding statistically significant gains over traditional retrieval baselines.does doxycycline contain sulfa BERT tokenized (9 subword-tokens): 'does', 'do', '##xy', '##cy', '##cl', '##ine', 'contain', 'sul', '##fa'ColBERTer BOW 2 (30 saved vectors from 84 subword-tokens): photosensitivity doxycycline 12.9 sulfa 14.2 sunburned rash clothing sunlight allergic compound drugs containing 6.6 take safely wear . is no 4.7 exposed ... Fulltext: No doxycycline is not a sulfa containing compound, so you may take it safely if you are allergic to sulfa drugs. You should be aware, however, that doxycycline may cause photosensitivity, so you should wear appropriate clothing, or you may get easily sunburned or develop a rash if you are exposed to sunlight.

show abstract

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

Section: Simplified Contextualized Stopwordsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction

Hofstätter¹,

Khattab²,

Althammer³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, allowing each query embedding the same chance to contribute to the candidate set may be sub-optimal. Indeed, consider a query embedding representing a stopword appearing in the query -retrieving many nearest neighbours to that query embedding is unlikely to retrieve as many relevant documents as a more discriminative query embedding [4,21] 1 .…”

Section: Rankings From the Approximate First Stagementioning

confidence: 99%

On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

Macdonald

Tonellotto

2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2× speedup in efficiency.

show abstract