AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation

Ma, Rui; Liu, Ye; Yavuz, Semih; Agarwal, Divyansh; Tu, Lifu; Yu, Na; Zhang, Jianguo; Meghana, Bhat,; Zhou, Yingbo

doi:10.48550/arxiv.2212.08841

Cited by 3 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latter generates quality but more expensive human-like queries using large language models for DR pre-training (Oguz et al, 2022) or domain adaptation (the third section of Table 1; . Concurrently to our work, Meng et al (2023) explore various approaches to query augmentation, such as span selection and document summarization.…”

Section: A Unified Framework Of Improved Densementioning

confidence: 99%

How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval

Lin,

Asai,

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and suboptimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, 1 our Dense Retriever trained with diverse AuGmentatiON, is the first BERTbase-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction.

show abstract

Section: A Unified Framework Of Improved Densementioning

confidence: 99%

How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval

Lin,

Asai,

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…This has included InPars (Bonifacio et al, 2022;Jeronymo et al, 2023) and Promptagator (Dai et al, 2022), the latter showcasing significant success on the BEIR benchmark. Augtriever (Meng et al, 2022) introduced methods for synthetic query generation using smaller models, optimizing both time and cost. Peng et al (2023) used soft prompt-tuning to further enhance the quality of generated queries.…”

Section: Related Workmentioning

confidence: 99%

“…Recent research exploits Large Language Models (LLMs) to generate synthetic data pairs, constructing synthetic queries from real passages, often derived from zero-shot or few-shot examples (Bonifacio et al, 2022;Jeronymo et al, 2023;Meng et al, 2022;Penha et al, 2023). Addressing the challenges of complex query information retrieval (IR) tasks through LLM-based synthetic data generation presents distinct difficulties.…”

Section: Introductionmentioning

confidence: 99%

Length Adaptive Regularization for Retrieval-based Chatbot Models

Wang

Fang

2020

Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

View full text Add to dashboard Cite

Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline-input, prompt, and output-each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at

show abstract