MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

Craswell, Nick; Mitra, Bhaskar; Yılmaz, Emine; Campos, Daniel; Lin, Jimmy

doi:10.1145/3404835.3462804

Cited by 41 publications

(32 citation statements)

References 74 publications

(79 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent perspective paper by Craswell et al [11] provides a complete exposition on the background and status of the MS MARCO project. That paper carefully and thoroughly addresses many common concerns regarding the MS MACRO datasets, including questions of internal validity, robust usefulness, and the reliability of statistical tests.…”

Section: Ms Marcomentioning

confidence: 99%

“…In this section, we provide only the background required to fully understand of the work reported in the current paper. In particular, Craswell et al [11] address concerns raised by Ferrante et al [12] who apply measurement theory to draw attention to important shortcomings of established evaluation measures, such as MRR. Many of these measures are not interval scaled, and therefore many common statistical tests are not permissible, and properly these measures should not even be averaged.…”

Section: Ms Marcomentioning

confidence: 99%

“…Over this three-year period, the IR community has tracked progress on a number of leaderboards, most notably the MS MARCO 2 leaderboards [19,11]. The MS MARCO project creates test collections focused on deep learning for search.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Shallow pooling for sparse labels

Arabzadeh¹,

Vtyurina²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent years have seen enormous gains in core information retrieval tasks, including document and passage ranking. Datasets and leaderboards, and in particular the MS MARCO datasets, illustrate the dramatic improvements achieved by modern neural rankers. When compared with traditional information retrieval test collections, such as those developed by TREC, the MS MARCO datasets employ substantially more queries -thousands vs. dozens -with substantially fewer known relevant items per query -often just one. For example, 94% of the nearly seven thousand queries in the MS MARCO passage ranking development set have only a single known relevant passage, and no query has more than four. Given the sparsity of these relevance labels, the MS MARCO leaderboards track improvements with mean reciprocal rank (MRR). In essence, a relevant item is treated as the "right answer", with rankers scored on their ability to place this item as high in the ranking as possible. In working with these sparse labels, we have observed that the top items returned by a ranker often appear superior to judged relevant items. Others have reported the same observation.To test this observation, we employed crowdsourced workers to make preference judgments between the top item returned by a modern neural ranking stack and a judged relevant item for the nearly seven thousand queries in the passage ranking development set. The results support our observation. If we imagine a perfect ranker under MRR, with a score of 1 on all queries, our preference judgments indicate that a searcher would prefer the top result from a modern neural ranking stack more frequently than the top result from the imaginary perfect ranker, making our neural ranker "better than perfect".To understand the implications for the leaderboard, we pooled the top document from available runs near the top of the passage ranking leaderboard for over 500 queries. We employed crowdsourced workers to make preference judgments over these pools and re-evaluated the runs. Our results support our concerns that current MS MARCO datasets may no longer be able to recognize genuine improvements in rankers. In future, if rankers are measured against a single "right answer", this answer should be the best answer or most preferred answer, and maintained with ongoing judgments. Since only the best answer is required, this ongoing maintenance might be performed with shallow pooling. When a previously unjudged document is surfaced as the top item in a ranking, it can directly compared with the previous known best answer.

show abstract

Section: Ms Marcomentioning

confidence: 99%

Section: Ms Marcomentioning

confidence: 99%

See 1 more Smart Citation

Shallow pooling for sparse labels

Arabzadeh¹,

Vtyurina²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In more detail, there can be a stack of complex re-rankers after the efficient first-stage retriever. The multi-stage cascaded architecture is very common and practical both in the industry (Yin et al, 2016;Liu et al, 2021d;Li and Xu, 2014) and the ranking leaderboard in the academia (Craswell et al, 2021). Considering the large computational cost of Transformer-based pre-trained models, they are often employed to model the last stage re-ranker whose goal is to re-rank a small set of documents provided by previous stage.…”

Section: Pre-training Methods Applied In Re-ranking Componentmentioning

confidence: 99%

“…-CWP200T, SogouT : CWP200T and SogouT (Luo et al, 2017) -MS MARCO: MS MARCO (Craswell et al, 2021) is a popular large-scale document collection which contains about 3.2 million available documents, which are from the Bing search engine. Besides, 1 million non-question queries are also included in this dataset for different retrieval tasks.…”

Section: Datasets For Pre-trainingmentioning

confidence: 99%

Pre-training Methods in Information Retrieval

Fan¹,

Xie²,

Cai³

et al. 2021

Preprint

View full text Add to dashboard Cite

The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to user's information need. Recently, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Since there have been a large number of works dedicating to the

show abstract