Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.679
|View full text |Cite
|
Sign up to set email alerts
|

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Abstract: Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of commonsense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 24 publications
(31 citation statements)
references
References 36 publications
0
24
0
Order By: Relevance
“…They showed that the success of a then-state-of-the-art LM ensemble (Trinh and Le, 2018) resulted mainly from improvements on simpler "associative" instances. Similarly, experiments by Abdou et al (2020) show that models are sensitive to linguistic perturbations of Winograd-style examples. New datasets have been proposed to circumvent issues of unintentionally easy test instances, including Winogrande (Sakaguchi et al, 2020), a scaled WSC-variant debiased against RoBERTa, and KnowRef , which consists of naturally occurring sentences that are free of WSC-specific stylistic quirks.…”
Section: Related Workmentioning
confidence: 91%
“…They showed that the success of a then-state-of-the-art LM ensemble (Trinh and Le, 2018) resulted mainly from improvements on simpler "associative" instances. Similarly, experiments by Abdou et al (2020) show that models are sensitive to linguistic perturbations of Winograd-style examples. New datasets have been proposed to circumvent issues of unintentionally easy test instances, including Winogrande (Sakaguchi et al, 2020), a scaled WSC-variant debiased against RoBERTa, and KnowRef , which consists of naturally occurring sentences that are free of WSC-specific stylistic quirks.…”
Section: Related Workmentioning
confidence: 91%
“…Probability-A (Pro-A) considers the generative probability of the choice conditioned on the question. However, it suffers from the statistical bias of choices, such as word frequency and sentence length (Abdou et al, 2020). To alleviate this, MutualInfo-QA (MI-QA) calculates the mutual information between the question and the choice.…”
Section: Related Workmentioning
confidence: 99%
“…Table 1 lists several typical score functions. However, these scores can be easily influenced by word frequencies, sentence structures, and other factors, which can mislead the models and make existing methods oversensitive to lexical perturbations (Abdou et al, 2020;Tamborrino et al, 2020). Figure 1 shows two examples.…”
Section: Introductionmentioning
confidence: 99%
“…In the examples shown throughout the paper, John and Mary are used as placeholders for the subject and object of the verb of interest. The choice of names to go in these slots can affect model predictions (Abdou et al, 2020), so we generate 200 variants of each stimulus, varying the names and the order between the two genders and we query the PLMs with all of them. The full procedure is described in Appendix B.…”
Section: Context-free Ic Biasmentioning
confidence: 99%
“…In this section we consider a seemingly minor but important consideration. Abdou et al (2020) showed that model predictions on the Winograd Schema Challenge greatly vary with changes in the gender and identity of proper nouns used in the stimuli. We alleviate this issue by marginalizing over a range of proper nouns.…”
Section: B Proper Nounsmentioning
confidence: 99%