2020
DOI: 10.1609/aaai.v34i05.6399
|View full text |Cite
|
Sign up to set email alerts
|

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Abstract: The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

5
327
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 290 publications
(364 citation statements)
references
References 0 publications
5
327
0
Order By: Relevance
“…The model generates the correct stereotypes when there is high lexical overlap with the post (e.g., examples d and e). This is in line with previous research showing that large language models rely on correlational patterns in data (Sap et al, 2019c;Sakaguchi et al, 2020).…”
Section: Classification Shown Insupporting
confidence: 93%
“…The model generates the correct stereotypes when there is high lexical overlap with the post (e.g., examples d and e). This is in line with previous research showing that large language models rely on correlational patterns in data (Sap et al, 2019c;Sakaguchi et al, 2020).…”
Section: Classification Shown Insupporting
confidence: 93%
“…Once again, natural language processing offers an excellent example: language models are generally trained on one or more general-purpose objectives (e.g. next-word prediction), and, after (often minimal) fine-tuning, they are evaluated against composite benchmarks (e.g., Sakaguchi, Le Bras, Bhagavatula, & Choi, 2019;Wang et al, 2019). In this regard, a particularly interesting example is that of GPT-3 (T. B.…”
Section: Model Evaluation In Machine Learningmentioning
confidence: 99%
“…Recently, a much larger set of Winograd Schemas, referred to as the WinoGrande set, has been created and used as the basis of the current Winograd Challenge non-human champion, a specialized version of the UnifiedQA solver [Khashabi et al 2020]. This solver attains more than 90% accuracy in the Winograd Challenge, a truly impressive figure that is similar to human accuracy [Sakaguchi et al 2019]. An important feature of the current champion solvers is that they are based on language models learned from large textual datasets; that is, they estimate the probability for each possible solution of a Winograd Schema, and output the most likely solutions.…”
Section: The Winograd Challenge Instead Focuses On Pairs Such Asmentioning
confidence: 99%
“…Indeed, we suspect that if one tries to follow the original guidelines concerning the Winograd Challenge as strictly as possible, then one will be left with Winograd Schemas that resemble the ones in WSC273. We now know that such guidelines limit too much the scope of Winograd Schemas: as demonstrated by recent results on computer solvers, Winograd Schemas that are (relatively) easy for human subjects are (relatively) easy for computers as well [Sakaguchi et al 2019]. In hindsight this is perhaps unsurprising because language must reflect established facts and rules and social conventions that must appear in large textual corpora.…”
Section: The Winograd Challenge Instead Focuses On Pairs Such Asmentioning
confidence: 99%