Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.659
|View full text |Cite
|
Sign up to set email alerts
|

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Abstract: We find that the performance of state-of-theart models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and fin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
12
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 34 publications
1
12
0
Order By: Relevance
“…OOD accuracy is highly variable across the spectrum of ID accuracies, and there is no precise linear trend. [118]. We explore this hypothesis in a synthetic CIFAR-10 setting, where we simulate increasing the similarity between examples by taking a small seed set of examples and then using data augmentations to create multiple similar versions.…”
Section: Camelyon17-wildsmentioning
confidence: 99%
See 1 more Smart Citation
“…OOD accuracy is highly variable across the spectrum of ID accuracies, and there is no precise linear trend. [118]. We explore this hypothesis in a synthetic CIFAR-10 setting, where we simulate increasing the similarity between examples by taking a small seed set of examples and then using data augmentations to create multiple similar versions.…”
Section: Camelyon17-wildsmentioning
confidence: 99%
“…Qualitatively, these bounds suggest that out-of-distribution accuracy may vary widely as a function of in-distribution accuracy unless the distribution distance d is small and the accuracies are therefore close (see Figure 1 (top-left) for an illustration). More recently, empirical studies have shown that in some settings, models with similar in-distribution performance can indeed have different out-of-distribution performance [29,71,118].…”
Section: Introductionmentioning
confidence: 99%
“…In particular, we examine the individual losses on each training batch and measure their variability using percentiles (i.e., 0th, 25th, 50th, 75th, and 100th percentile). Figure 5 shows the comparison of the individual loss vari- Bias identification stability Researchers have recently observed large variability in the generalization performance of fine-tuned BERT model (Mosbach et al, 2020;Zhang et al, 2020), especially in the out-of-distribution evaluation settings (McCoy et al, 2019a;Zhou et al, 2020). This may raise concerns on whether our shallow models, which are trained on the sub-sample of the training data, can consistently learn to rely mostly on biases.…”
Section: Impact On Learning Dynamicsmentioning
confidence: 99%
“…Existing NLP model analysis tools are often tailored towards specific tasks or models (e.g., Wang et al, 2019;Zhou et al, 2020). In the remainder of this section, we give examples for model-agnostic tools as they are more related to our tool.…”
Section: Tools For Analyzing Nlp Modelsmentioning
confidence: 99%
“…In natural language processing (NLP), the standard approach for tuning and selecting machine learning models is by means of using a held-out development set. However, recent work has pointed out that evaluation scores on a development set are often not indicative of the model performance on an unseen test set (Reimers and Gurevych, 2018;Zhou et al, 2020). In addition, it is an open research question how to choose a good development set.…”
Section: Introductionmentioning
confidence: 99%