WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi, Kei; Bras, Ronan Le; Bhagavatula, Chandra; Choi, Yejin

doi:10.1609/aaai.v34i05.6399

Cited by 290 publications

(364 citation statements)

References 0 publications

Supporting

Mentioning

327

Contrasting

Order By: Relevance

“…The model generates the correct stereotypes when there is high lexical overlap with the post (e.g., examples d and e). This is in line with previous research showing that large language models rely on correlational patterns in data (Sap et al, 2019c;Sakaguchi et al, 2020).…”

Section: Classification Shown Insupporting

confidence: 93%

Social Bias Frames: Reasoning about Social and Power Implications of Language

Sap

Gabriel

Qin

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

164

190

View full text Add to dashboard Cite

Warning: this paper contains content that may be offensive or upsetting. We then establish baseline approaches that learn to recover SOCIAL BIAS FRAMES from unstructured text. We find that while stateof-the-art neural models are effective at highlevel categorization of whether a given statement projects unwanted social bias (80% F 1), they are not effective at spelling out more detailed explanations in terms of SOCIAL BIAS FRAMES. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

show abstract

Section: Classification Shown Insupporting

confidence: 93%

Social Bias Frames: Reasoning about Social and Power Implications of Language

Sap

Gabriel

Qin

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

164

190

View full text Add to dashboard Cite

show abstract

“…Once again, natural language processing offers an excellent example: language models are generally trained on one or more general-purpose objectives (e.g. next-word prediction), and, after (often minimal) fine-tuning, they are evaluated against composite benchmarks (e.g., Sakaguchi, Le Bras, Bhagavatula, & Choi, 2019;Wang et al, 2019). In this regard, a particularly interesting example is that of GPT-3 (T. B.…”

Section: Model Evaluation In Machine Learningmentioning

confidence: 99%

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

Rocca¹,

Yarkoni²

2020

Preprint

View full text Add to dashboard Cite

Consensus on standards for evaluating models and theories is an integral part of every science. Nonetheless, in psychology, relatively little focus has been placed on defining reliable communal metrics to assess model performance. Evaluation practices are often idiosyncratic, and are affected by a number of shortcomings (e.g., failure to assess models' ability to generalize to unseen data) that make it difficult to discriminate between good and bad models. Drawing inspiration from fields like machine learning and statistical genetics, we argue in favor of introducing common benchmarks as a means of overcoming the lack of reliable model evaluation criteria currently observed in psychology. We discuss a number of principles benchmarks should satisfy to achieve maximal utility; identify concrete steps the community could take to promote the development of such benchmarks; and address a number of potential pitfalls and concerns that may arise in the course of implementation. We argue that reaching consensus on common evaluation benchmarks will foster cumulative progress in psychology, and encourage researchers to place heavier emphasis on the practical utility of scientific models.

show abstract

“…Recently, a much larger set of Winograd Schemas, referred to as the WinoGrande set, has been created and used as the basis of the current Winograd Challenge non-human champion, a specialized version of the UnifiedQA solver [Khashabi et al 2020]. This solver attains more than 90% accuracy in the Winograd Challenge, a truly impressive figure that is similar to human accuracy [Sakaguchi et al 2019]. An important feature of the current champion solvers is that they are based on language models learned from large textual datasets; that is, they estimate the probability for each possible solution of a Winograd Schema, and output the most likely solutions.…”

Section: The Winograd Challenge Instead Focuses On Pairs Such Asmentioning

confidence: 99%

“…Indeed, we suspect that if one tries to follow the original guidelines concerning the Winograd Challenge as strictly as possible, then one will be left with Winograd Schemas that resemble the ones in WSC273. We now know that such guidelines limit too much the scope of Winograd Schemas: as demonstrated by recent results on computer solvers, Winograd Schemas that are (relatively) easy for human subjects are (relatively) easy for computers as well [Sakaguchi et al 2019]. In hindsight this is perhaps unsurprising because language must reflect established facts and rules and social conventions that must appear in large textual corpora.…”

Section: The Winograd Challenge Instead Focuses On Pairs Such Asmentioning

confidence: 99%

The Winograd Schemas from Hell

Cozman

Munhoz

2020

Anais Do Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2020)

View full text Add to dashboard Cite

The Winograd Challenge has been advocated as a test of computer understanding with respect to commonsense reasoning. The challenge is based on Winograd Schemas: sentences that contain correferential ambiguities. Most Winograd Schemas are relatively easy for human subjects, and today the best computer systems for the Winograd Challenge can work close to human performance. In this paper, we examine the assumptions behind the Winograd Challenge, and investigate how far we can push the difficulty level of Winograd Schemas, proposing various strategies to build really challenging schemas.

show abstract

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Cited by 290 publications

References 0 publications

Social Bias Frames: Reasoning about Social and Power Implications of Language

Social Bias Frames: Reasoning about Social and Power Implications of Language

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

The Winograd Schemas from Hell

Contact Info

Product

Resources

About