The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Zhou, Xiang; Nie, Yixin; Tan, Hao; Bansal, Mohit

doi:10.18653/v1/2020.emnlp-main.659

Cited by 21 publications

(15 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OOD accuracy is highly variable across the spectrum of ID accuracies, and there is no precise linear trend. [118]. We explore this hypothesis in a synthetic CIFAR-10 setting, where we simulate increasing the similarity between examples by taking a small seed set of examples and then using data augmentations to create multiple similar versions.…”

Section: Camelyon17-wildsmentioning

confidence: 99%

“…Qualitatively, these bounds suggest that out-of-distribution accuracy may vary widely as a function of in-distribution accuracy unless the distribution distance d is small and the accuracies are therefore close (see Figure 1 (top-left) for an illustration). More recently, empirical studies have shown that in some settings, models with similar in-distribution performance can indeed have different out-of-distribution performance [29,71,118].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

Miller,

Taori,

Raghunathan

et al. 2021

Preprint

View full text Add to dashboard Cite

For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, satellite imagery classification in FMoW-WILDS, and wildlife classification in iWildCam-WILDS. The strong correlations hold across model architectures, hyperparameters, training set size, and training duration, and are more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.

show abstract

Section: Camelyon17-wildsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

Miller,

Taori,

Raghunathan

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In particular, we examine the individual losses on each training batch and measure their variability using percentiles (i.e., 0th, 25th, 50th, 75th, and 100th percentile). Figure 5 shows the comparison of the individual loss vari- Bias identification stability Researchers have recently observed large variability in the generalization performance of fine-tuned BERT model (Mosbach et al, 2020;Zhang et al, 2020), especially in the out-of-distribution evaluation settings (McCoy et al, 2019a;Zhou et al, 2020). This may raise concerns on whether our shallow models, which are trained on the sub-sample of the training data, can consistently learn to rely mostly on biases.…”

Section: Impact On Learning Dynamicsmentioning

confidence: 99%

Towards Debiasing NLU Models from Unknown Biases

Utama¹,

Moosavi²,

Gurevych³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

NLU models often exploit biases to achieve high dataset-specific performance without properly learning the intended task. Recently proposed debiasing methods are shown to be effective in mitigating this tendency. However, these methods rely on a major assumption that the types of bias should be known a-priori, which limits their application to many NLU tasks and datasets. In this work, we present the first step to bridge this gap by introducing a self-debiasing framework that prevents models from mainly utilizing biases without knowing them in advance. The proposed framework is general and complementary to the existing debiasing methods. We show that it allows these existing methods to retain the improvement on the challenge datasets (i.e., sets of examples designed to expose models' reliance on biases) without specifically targeting certain biases. Furthermore, the evaluation suggests that applying the framework results in improved overall robustness. 1

show abstract

“…Existing NLP model analysis tools are often tailored towards specific tasks or models (e.g., Wang et al, 2019;Zhou et al, 2020). In the remainder of this section, we give examples for model-agnostic tools as they are more related to our tool.…”

Section: Tools For Analyzing Nlp Modelsmentioning

confidence: 99%

“…In natural language processing (NLP), the standard approach for tuning and selecting machine learning models is by means of using a held-out development set. However, recent work has pointed out that evaluation scores on a development set are often not indicative of the model performance on an unseen test set (Reimers and Gurevych, 2018;Zhou et al, 2020). In addition, it is an open research question how to choose a good development set.…”

Section: Introductionmentioning

confidence: 99%

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation

Wecker¹,

Friedrich²,

Adel³

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

This paper adds to the ongoing discussion in the natural language processing community on how to choose a good development set. Motivated by the real-life necessity of applying machine learning models to different data distributions, we propose a clustering-based data splitting algorithm. It creates development (or test) sets which are lexically different from the training data while ensuring similar label distributions. Hence, we are able to create challenging cross-validation evaluation setups while abstracting away from performance differences resulting from label distribution shifts between training and test data. In addition, we present a Python-based tool for analyzing and visualizing data split characteristics and model performance. We illustrate the workings and results of our approach using a sentiment analysis and a patent classification task.

show abstract

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

Cited by 21 publications

References 34 publications

Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

Towards Debiasing NLU Models from Unknown Biases

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation

Contact Info

Product

Resources

About