On Model Stability as a Function of Random Seed

Madhyastha, Pranava; Jain, Rishabh

doi:10.18653/v1/k19-1087

Cited by 39 publications

(24 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several works have noted that the same architecture can have very different in-distribution generalization across restarts of the same training process Gurevych, 2017, 2018;Madhyastha and Jain, 2019). Most relevantly for our work, finetuning of BERT is unstable for some datasets, such that some runs achieve state-of-the-art results while others perform poorly (Devlin et al, 2019;Phang et al, 2018).…”

Section: In-distribution Generalizationmentioning

confidence: 75%

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

McCoy¹,

Min²,

Linzen³

2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

If the same neural network architecture is trained multiple times on the same dataset, will it make similar linguistic generalizations across runs? To study this question, we finetuned 100 instances of BERT on the Multigenre Natural Language Inference (MNLI) dataset and evaluated them on the HANS dataset, which evaluates syntactic generalization in natural language inference. On the MNLI development set, the behavior of all instances was remarkably consistent, with accuracy ranging between 83.6% and 84.8%. In stark contrast, the same models varied widely in their generalization performance. For example, on the simple case of subject-object swap (e.g., determining that the doctor visited the lawyer does not entail the lawyer visited the doctor), accuracy ranged from 0.0% to 66.2%. Such variation is likely due to the presence of many local minima in the loss surface that are equally attractive to a low-bias learner such as a neural network; decreasing the variability may therefore require models with stronger inductive biases.

show abstract

Section: In-distribution Generalizationmentioning

confidence: 75%

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

McCoy¹,

Min²,

Linzen³

2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

show abstract

“…While this effect is common to all supervised machine learning models, it gets amplified in our case due to the large imbalance and low abundance of annotations for training. With periodic checks during training, a stable model state can be achieved, but further work may attempt to improve model stability by, for example adding regularizers, or incorporating more advanced weighting schemes (Madhyastha & Jain, 2019).…”

Section: Limitations and Future Perspectivesmentioning

confidence: 99%

21 000 birds in 4.5 h: efficient large‐scale seabird detection with machine learning

Kellenberger

Veen

Folmer³

et al. 2021

Remote Sens Ecol Conserv

View full text Add to dashboard Cite

We address the task of automatically detecting and counting seabirds in unmanned aerial vehicle (UAV) imagery using deep convolutional neural networks (CNNs). Our study area, the coast of West Africa, harbours significant breeding colonies of terns and gulls, which as top predators in the food web function as important bioindicators for the health of the marine ecosystem. Surveys to estimate breeding numbers have hitherto been carried out on foot, which is tedious, imprecise and causes disturbance. By using UAVs and CNNs that allow localizing tens of thousands of birds automatically, we show that all three limitations can be addressed elegantly. As we employ a lightweight CNN architecture and incorporate prior knowledge about the spatial distribution of birds within the colonies, we were able to reduce the number of bird annotations required for CNN training to just 200 examples per class. Our model obtains good accuracy for the most abundant species of royal terns (90% precision at 90% recall), but is less accurate for the rarer Caspian terns and gull species (60% precision at 68% recall, respectively 20% precision at 88% recall), which amounts to around 7% of all individuals present. In sum, our results show that we can detect and classify the majority of 21 000 birds in just 4.5 h, start to finish, as opposed to about 3 weeks of tediously identifying and labelling all birds by hand.

show abstract

“…Second, using a categorical feature to denote model types constrains its expressive power for modeling performance. In reality, a slight change in model hyperparameters (Hoos and Leyton-Brown, 2014;Probst et al, 2019), optimization algorithms (Kingma and Ba, 2014), or even random seeds (Madhyastha and Jain, 2019) may give rise to a significant variation in performance, which our predictor is not able to capture. While investigating the systematic implications of model structures or hyperparameters is practically infeasible in this study, we may use additional information such as textual model descriptions for modeling NLP models and training procedures more elaborately in the future.…”

Section: Discussionmentioning

confidence: 99%

Predicting Performance for Natural Language Processing Tasks

Xia

Anastasopoulos

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings. 1

show abstract

On Model Stability as a Function of Random Seed

Cited by 39 publications

References 23 publications

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

21 000 birds in 4.5 h: efficient large‐scale seabird detection with machine learning

Predicting Performance for Natural Language Processing Tasks

Contact Info

Product

Resources

About