2021
DOI: 10.48550/arxiv.2104.14337
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Dynabench: Rethinking Benchmarking in NLP

Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 12 publications
(23 citation statements)
references
References 52 publications
0
23
0
Order By: Relevance
“…In recent years, this approach has also been proposed as a method of evaluating language model classifiers in general. Several recent datasets and benchmarks are constructed with human-in-theloop adversaries, such AdversarialNLI [36], AdversarialGLUE [37], and DynaBench [38]. Our analysis of the effects of multiple iterations of adversarial training resembles DADC [39].…”
Section: Adversarial Training For Language Modelsmentioning
confidence: 99%
“…In recent years, this approach has also been proposed as a method of evaluating language model classifiers in general. Several recent datasets and benchmarks are constructed with human-in-theloop adversaries, such AdversarialNLI [36], AdversarialGLUE [37], and DynaBench [38]. Our analysis of the effects of multiple iterations of adversarial training resembles DADC [39].…”
Section: Adversarial Training For Language Modelsmentioning
confidence: 99%
“…Along with the preponderance of high-quality text data and the simplicity of scaling language models, these benchmarks have helped steer the field toward rapid progress (Brown et al, 2020;Rae et al, 2021). Recently, with the arrival of highly capable language models, human evaluation has become a crucial tool, allowing the dynamic evaluation of models as they improve (Kiela et al, 2021;Thoppilan et al, 2022). These methods are complementary to more static benchmarks like SuperGLUE (Wang et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Soon after Transformers took over the field, adversarial tests resulted in significantly lower performance figures, which increased the importance of adversarial attacks [16]. General shortcomings of language models and their benchmarks led to new approaches such as Dynabench [17]. Adversarial GLUE (AdvGLUE) [18] focuses on the added difficulty of maintaining the semantic meaning when applying a general attack framework for generating adversarial texts.…”
Section: Related Workmentioning
confidence: 99%