Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems 2021
DOI: 10.18653/v1/2021.eval4nlp-1.3
|View full text |Cite
|
Sign up to set email alerts
|

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task

Abstract: Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (Lan et al., 2020) in combination with Stochastic Weight Averaging (SWA)-a cheap way of ensembling-on a sentiment analysis task (SST-2). In particular, we analyze SWA's stability via CheckList criteria (Ribeiro et al., 2020), examining the agreement on errors made by models differing only in their random seed. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 16 publications
(12 reference statements)
0
0
0
Order By: Relevance
“…Fleiss' Kappa Similar to Khurana et al (2021), we adopt Fleiss' Kappa, which is a popular measure for inter-rater consistency (Fleiss, 1971), to measure the consistency among different models' predictions. Because Fleiss' Kappa is negatively correlated with models' instability and ranges from 0 to 1, we use its difference with one as the output, to stay consistent with other measures.…”
Section: Prediction Measuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Fleiss' Kappa Similar to Khurana et al (2021), we adopt Fleiss' Kappa, which is a popular measure for inter-rater consistency (Fleiss, 1971), to measure the consistency among different models' predictions. Because Fleiss' Kappa is negatively correlated with models' instability and ranges from 0 to 1, we use its difference with one as the output, to stay consistent with other measures.…”
Section: Prediction Measuresmentioning
confidence: 99%
“…To better assess the instability of fine-tuning pretrained language models (PLMs), we study more measures concerning instability at different granularity levels (Summers and Dinneen, 2021;Khurana et al, 2021;Raghu et al, 2017;Kornblith et al, 2019;Ding et al, 2021) and develop a framework to assess their validity. We focus on BERT and RoBERTa for their popularity, but our framework can also be applied to other PLMs.…”
Section: Introductionmentioning
confidence: 99%