Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.324
|View full text |Cite
|
Sign up to set email alerts
|

Dynabench: Rethinking Benchmarking in NLP

Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
79
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 84 publications
(81 citation statements)
references
References 81 publications
1
79
0
Order By: Relevance
“…Adversarial filtering There has been a trend toward adversarial data collection paradigms, in which annotators are asked to produce examples on which current systems fail (Le Bras et al, 2020;Kiela et al, 2021;Talmor et al, 2021;Jia and Liang, 2017;Zellers et al, 2019, i.a.). While this eliminates examples containing artifacts that models have already learned, it does not prevent the creation of new ones on which current models fail.…”
Section: Related Workmentioning
confidence: 99%
“…Adversarial filtering There has been a trend toward adversarial data collection paradigms, in which annotators are asked to produce examples on which current systems fail (Le Bras et al, 2020;Kiela et al, 2021;Talmor et al, 2021;Jia and Liang, 2017;Zellers et al, 2019, i.a.). While this eliminates examples containing artifacts that models have already learned, it does not prevent the creation of new ones on which current models fail.…”
Section: Related Workmentioning
confidence: 99%
“…We convert natural language explanations to oracle importance scores instead of collecting oracle importance scores directly from naïve annotators for two reasons. First, there already exist data sets of natural language explanations, where annotators were required to reason about models' decision making in an adversarial setting (Nie et al, 2020), and more such data sets are being generated (Kiela et al, 2021). Second, we contend that for most nonexpert annotators, asking them to provide verbal descriptions is easier and more natural than asking them to answer a question like, "For which words do you think the model's prediction would change the most if that word was blanked out?".…”
Section: Computing Oracle Importance Scoresmentioning
confidence: 99%
“…Previous work has shed light on several dimensions of hate speech data that prevents generalisation, such as imprecise construct specification (Samory et al, 2021), biased data collection (Ousidhoum et al, 2020), and annotation artifacts (Waseem, 2016). Several solutions have been proposed for these issues such as adversarial data generation (Dinan et al, 2019), dynamic benchmarking (Kiela et al, 2021) and debiasing techniques .…”
Section: Related Workmentioning
confidence: 99%