Proceedings of the Second Workshop on Insights From Negative Results in NLP 2021
DOI: 10.18653/v1/2021.insights-1.18
|View full text |Cite
|
Sign up to set email alerts
|

Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics

Abstract: Much of recent progress in NLU was shown to be due to models' learning dataset-specific heuristics. We conduct a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 39 publications
(19 citation statements)
references
References 23 publications
(25 reference statements)
0
18
1
Order By: Relevance
“…The only difference in the procedure is that we take the representation of the [CLS] token to be the embedding of the sentence and omit the MLP. We study one architecture, BERT-Small (Bhargava et al, 2021;Turc et al, 2019), which is a BERT architecture with 4 hidden layers (Devlin et al, 2018).…”
Section: Methodsmentioning
confidence: 99%
“…The only difference in the procedure is that we take the representation of the [CLS] token to be the embedding of the sentence and omit the MLP. We study one architecture, BERT-Small (Bhargava et al, 2021;Turc et al, 2019), which is a BERT architecture with 4 hidden layers (Devlin et al, 2018).…”
Section: Methodsmentioning
confidence: 99%
“…Methods under this category do not directly alter the training dataset, but instead resort to changes in the modeling technique -these changes can be in terms of the optimization function, regularization, additional auxiliary costs, etc. The main idea in DB is to utilize known biases (or identify unknown biases) in the data distribution, model these biases in the training pipeline, and use this knowledge to train robust classifiers (Clark et al, 2019;Bhargava et al, 2021). In the image classification literature, there is growing consensus on enforcing a consistency on different views (or augmentations) of an image in order to achieve debiasing (Hendrycks et al, 2020c;Xu et al, 2020;Chai et al, 2021;Nam et al, 2021).…”
Section: Categorization Of Domain Generalization Methodsmentioning
confidence: 99%
“…The third category on our source of shift axis concerns the case in which one data partition (usually the training set) is a fully natural corpus, but the other partition is designed with specific properties in mind, to address a generalisation aspect of interest. Data in the constructed partition may avoid or contain specific (syntactic) patterns (Bhargava et al, 2021;Cui et al, 2022), violate heuristics about gender (Dayanik and Padó, 2021;Libovický et al, 2022), or include unusually long or complex sequences (Lakretz et al, 2021a;Raunak et al, 2019). As an example of this shift source, Dankers et al (2022) or automatically using a specific model (e.g.…”
Section: Generated Shiftsmentioning
confidence: 99%