Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.393
|View full text |Cite
|
Sign up to set email alerts
|

Utility is in the Eye of the User: A Critique of NLP Leaderboards

Abstract: Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
92
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 106 publications
(105 citation statements)
references
References 53 publications
1
92
0
Order By: Relevance
“…While crowdsourcing has been a boon for largescale NLP dataset creation (Snow et al, 2008;Munro et al, 2010), we ultimately want NLP systems to handle "natural" data (Kwiatkowski et al, 2019) and be "ecologically valid" (de Vries et al, 2020). Ethayarajh and Jurafsky (2020) analyze the distinction between what leaderboards incentivize and "what is useful in practice" through the lens of microeconomics. A natural setting for exploring these ideas might be dialogue (Hancock et al, 2019;Shuster et al, 2020).…”
Section: Other Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…While crowdsourcing has been a boon for largescale NLP dataset creation (Snow et al, 2008;Munro et al, 2010), we ultimately want NLP systems to handle "natural" data (Kwiatkowski et al, 2019) and be "ecologically valid" (de Vries et al, 2020). Ethayarajh and Jurafsky (2020) analyze the distinction between what leaderboards incentivize and "what is useful in practice" through the lens of microeconomics. A natural setting for exploring these ideas might be dialogue (Hancock et al, 2019;Shuster et al, 2020).…”
Section: Other Related Workmentioning
confidence: 99%
“…We would be able to capture not only accuracy, for example, but also usage of computational resources, inference time, fairness, and many other relevant dimensions. This will in turn enable dynamic leaderboards, for example based on utility (Ethayarajh and Jurafsky, 2020). This would also allow for backward-compatible comparisons, not having to worry about the benchmark changing, and automatically putting new state of the art models in the loop, addressing some of the main objections.…”
Section: Live Model Evaluation Model Evaluationmentioning
confidence: 99%
“…Dodge et al (2019) lay out a set of best practices for results reporting, with a focus on the impact of hyperparameter tuning on model comparison. Ethayarajh and Jurafsky (2020) advocate for the inclusion of efficiency considerations in leaderboard design. Boyd-Graber and Börschinger (2020) describe ways that trivia competitions can provide a model for carefully-considered dataset design.…”
Section: Related Workmentioning
confidence: 99%
“…Thus, we must build systems that have the ability to instantaneously flag data abnormalities -both in the research phase and when translated into real clinical use -and pass these cases on for human review. Furthermore, rather than selecting a preferred machine learning model based on metrics such as accuracy, sensitivity, or correlation as is common in AI and NLP applications, we must seek to understand the underlying mechanisms and the context in which they will be used (Ethayarajh and Jurafsky, 2020;Hand, 2006).…”
Section: Introductionmentioning
confidence: 99%