2021
DOI: 10.48550/arxiv.2104.08663
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Abstract: Neural IR models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their generalization capabilities. To address this, and to allow researchers to more broadly establish the effectiveness of their models, we introduce BEIR (Benchmarking IR), a heterogeneous benchmark for information retrieval. We leverage a careful selection of 17 datasets for evaluation spanning diverse retrieval tasks including open-domain datasets as well as narrow expert domains. We st… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
89
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 53 publications
(115 citation statements)
references
References 32 publications
3
89
0
Order By: Relevance
“…In fact, we already know the answer, at least in part: learned representations often perform terribly in out-of-distribution settings when applied in a zero-shot manner. Evidence comes from the BEIR benchmark [Thakur et al, 2021], which aims to evaluate the effectiveness of dense retrieval models across diverse domains. Results show that, in many cases, directly applying a dense retrieval model trained on one dataset to another dataset sometimes yields effectiveness that is worse than BM25.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In fact, we already know the answer, at least in part: learned representations often perform terribly in out-of-distribution settings when applied in a zero-shot manner. Evidence comes from the BEIR benchmark [Thakur et al, 2021], which aims to evaluate the effectiveness of dense retrieval models across diverse domains. Results show that, in many cases, directly applying a dense retrieval model trained on one dataset to another dataset sometimes yields effectiveness that is worse than BM25.…”
Section: Discussionmentioning
confidence: 99%
“…For example, Li et al [2021] proposed model uncertainty fusion as a solution. The BEIR benchmark [Thakur et al, 2021] provides a resource to evaluate progress, and the latest results show that learned sparse representations are able to outperform BM25 [Formal et al, 2021a]. At a high level, there are at least three intertwined research questions:…”
Section: Discussionmentioning
confidence: 99%
“…A potential solution is to train a dense retriever on a large retrieval dataset such as MS-MARCO, and then apply it to new domains, a setting refered to as zero-shot. Unfortunately, in this setting dense retrievers are often outperformed by classical methods based on term-frequency, which do not require supervision (Thakur et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…Here, neural approaches lead to enormous effectiveness gains over traditional techniques [8,13,20,24]. A valid concern is the generalizability and applicability of the developed techniques to other domains and settings [16,14,31,34].…”
Section: Introductionmentioning
confidence: 99%