Dynabench: Rethinking Benchmarking in NLP

Bartolo, Max; Nie, Yong; Kaushik, Divyansh; Geiger, Atticus; Wu, Zhengxuan; Vidgen, Bertie; Prasad, Grusha; Singh, Amanpreet; Ringshia, Pratik; Ma, Zhiyong; Thrush, Tristan; Riedel, Sebastian; Waseem, Zeerak; Stenetorp, Pontus; Jia, Robin; Bansal, Mohit; Potts, Christopher; Williams, Adina

doi:10.48550/arxiv.2104.14337

Cited by 12 publications

(23 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, this approach has also been proposed as a method of evaluating language model classifiers in general. Several recent datasets and benchmarks are constructed with human-in-theloop adversaries, such AdversarialNLI [36], AdversarialGLUE [37], and DynaBench [38]. Our analysis of the effects of multiple iterations of adversarial training resembles DADC [39].…”

Section: Adversarial Training For Language Modelsmentioning

confidence: 99%

Adversarial Training for High-Stakes Reliability

Ziegler¹,

Nix²,

Chan³

et al. 2022

Preprint

View full text Add to dashboard Cite

In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques-including a tool that assists human adversaries-to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our simple "avoid injuries" task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. With our chosen thresholds, filtering with our baseline classifier decreases the rate of unsafe completions from about 2.4% to 0.003% on in-distribution data, which is near the limit of our ability to measure. We found that adversarial training significantly increased robustness to the adversarial attacks that we trained on, without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

show abstract

Section: Adversarial Training For Language Modelsmentioning

confidence: 99%

Adversarial Training for High-Stakes Reliability

Ziegler¹,

Nix²,

Chan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Along with the preponderance of high-quality text data and the simplicity of scaling language models, these benchmarks have helped steer the field toward rapid progress (Brown et al, 2020;Rae et al, 2021). Recently, with the arrival of highly capable language models, human evaluation has become a crucial tool, allowing the dynamic evaluation of models as they improve (Kiela et al, 2021;Thoppilan et al, 2022). These methods are complementary to more static benchmarks like SuperGLUE (Wang et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

Evaluating Multimodal Interactive Agents

Abramson¹,

Ahuja²,

Carnevale³

et al. 2022

Preprint

View full text Add to dashboard Cite

Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. https://youtu.be/YR1TngGORGQ

show abstract

“…Soon after Transformers took over the field, adversarial tests resulted in significantly lower performance figures, which increased the importance of adversarial attacks [16]. General shortcomings of language models and their benchmarks led to new approaches such as Dynabench [17]. Adversarial GLUE (AdvGLUE) [18] focuses on the added difficulty of maintaining the semantic meaning when applying a general attack framework for generating adversarial texts.…”

Section: Related Workmentioning

confidence: 99%

Evaluation of Semantic Answer Similarity Metrics

Mustafazade¹,

Ebbinghaus²

2022

Machine Learning &Amp; Applications

View full text Add to dashboard Cite

There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used both for training and as a benchmark.

show abstract

Dynabench: Rethinking Benchmarking in NLP

Cited by 12 publications

References 52 publications

Adversarial Training for High-Stakes Reliability

Adversarial Training for High-Stakes Reliability

Evaluating Multimodal Interactive Agents

Evaluation of Semantic Answer Similarity Metrics

Contact Info

Product

Resources

About