Pre-training Methods in Information Retrieval

Fan, Yixing; Xie, Xiaohui; Cai, Yinqiong; Chen, Jia; Ma, Xinyu; Li, Xiangsheng; Zhang, Ruqing; Guo, Jiafeng

doi:10.48550/arxiv.2111.13853

Cited by 11 publications

(12 citation statements)

References 200 publications

(305 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, based on the text representation type and corpus index mode, passage retrieval models can be roughly categorized into two main classes. Sparse retrieval Models: improving retrieval by obtaining semantic-captured sparse representations and index them with the inverted index for efficient retrieval; Dense Retrieval Models: converting query and passage into continuous embedding representations and turn to approximate nearest neighbor (ANN) algorithms for fast retrieval [13].…”

Section: Related Workmentioning

confidence: 99%

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Long¹,

Gao²,

Zou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Passage retrieval is a fundamental task in information retrieval (IR) research, which has drawn much attention recently. In English field, the availability of large-scale annotated dataset (e.g, MS MARCO) and the emergence of deep pre-trained language models (e.g, BERT) have resulted in a substantial improvement of existing passage retrieval systems. However, in Chinese field, especially for specific domain, passage retrieval systems are still immature due to quality-annotated dataset being limited by scale. Therefore, in this paper, we present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR). The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical. Each dataset contains millions of passages and a certain amount of human annotated query-passage related pairs. We implement various representative passage retrieval methods as baselines. We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain. Nevertheless, passage retrieval system built on in-domain annotated dataset can achieve significant improvement, which indeed demonstrates the necessity of domain labeled data for further optimization. We hope the release of the Multi-CPR dataset could benchmark Chinese passage retrieval task in specific domain and also make advances for future studies.

show abstract

Section: Related Workmentioning

confidence: 99%

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Long¹,

Gao²,

Zou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Continuing pretraining the off-the-shelf language model has been investigated in mono-lingual retrival [5,13,16]. Specifically, coCondenser [16] continued pretraining of the language model with a passage-containing classification task (i.e., determining if a pair of passages belong to the same document) through contrastive learning on the representation of the passages for monolingual IR before fine-tuning it as a DPR model.…”

Section: Background and Related Workmentioning

confidence: 99%

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

Yang,

Nair,

Chandradevan

et al. 2022

Preprint

View full text Add to dashboard Cite

Pretrained language models have improved effectiveness on numerous tasks, including ad-hoc retrieval. Recent work has shown that continuing to pretrain a language model with auxiliary objectives before fine-tuning on the retrieval task can further improve retrieval effectiveness. Unlike monolingual retrieval, designing an appropriate auxiliary task for cross-language mappings is challenging. To address this challenge, we use comparable Wikipedia articles in different languages to further pretrain off-the-shelf multilingual pretrained models before fine-tuning on the retrieval task. We show that our approach yields improvements in retrieval effectiveness. CCS CONCEPTS• Information systems → Retrieval models and ranking.

show abstract

“…Dense retrieval is receiving increasing interest in recent years from both industrial and academic communities due to its benefits to many IR related tasks, e.g., Web search [9,17,26], question answering [20,23,43] and conversational systems [10,39]. Without loss of generality, dense retrieval usually utilizes a Siamese or bi-encoder architecture to encode queries and documents into low-dimensional representations to abstract their semantic information [18,19,21,38,40,41].…”

Section: Introductionmentioning

confidence: 99%

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Ma,

Guo,

Zhang

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a groupwise contrastive loss. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly. Code and pre-trained models will be available at the URL 1 . CCS CONCEPTS• Information systems → Information retrieval.

show abstract

Pre-training Methods in Information Retrieval

Cited by 11 publications

References 200 publications

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Contact Info

Product

Resources

About