A Marker Passing Approach to Winograd Schemas

Fähndrich, Johannes; Weber, Sabine; Kanthak, Hannes

doi:10.1007/978-3-030-04284-4_12

Cited by 5 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since WSC was proposed as a benchmark for commonsense (Levesque et al, 2012), there were many attempts to improve performance on this benchmark, that involved different approaches including web queries (Rahman and Ng, 2012;Sharma et al, 2015;Emami et al, 2018), using external knowledge sources (Sharma, 2019), information extraction and reasoning (Isaak and Michael, 2016) and more (Peng et al, 2015;Liu et al, 2017a,b;Fähndrich et al, 2018;Klein and Nabi, 2019;Zhang et al, 2019Zhang et al, , 2020a. Newer approaches use LMs to assign a probability to a sentence by replacing the pronoun with an entity, one at a time, and pick the more probable sentence (Trinh and Le, 2018;Opitz and Frank, 2018;Radford et al, 2019;Kocijan et al, 2019).…”

Section: Progress On Wscmentioning

confidence: 99%

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Elazar¹,

Zhang

Goldberg³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge. 1

show abstract

Section: Progress On Wscmentioning

confidence: 99%

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Elazar¹,

Zhang

Goldberg³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…The third step is where approaches differ the most one from another. Approaches used were, for example, SVM-rankers (Rahman and Ng, 2012), integer linear programming (Peng et al, 2015), answer set programming (Sharma et al, 2015), message passing on a graph (Fähndrich et al, 2018), and formal logic (Isaak and Michael, 2016).…”

Section: Feature-based Approaches To the Winograd Schema Challengementioning

confidence: 99%

The Defeat of the Winograd Schema Challenge

Kocijan¹,

Davis²,

Lukasiewicz³

et al. 2022

Preprint

View full text Add to dashboard Cite

The Winograd Schema Challenge-a set of twin sentences involving pronoun reference disambiguation that seem to require the use of commonsense knowledge-was proposed by Hector Levesque in 2011. By 2019, a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy. In this paper, we review the history of the Winograd Schema Challenge and assess its significance.

show abstract

“…Since WSC was proposed as a benchmark for commonsense (Levesque et al, 2012), there were many attempts to improve performance on this benchmark, that involved different approaches from web queries (Rahman and Ng, 2012;Sharma et al, 2015;Emami et al, 2018), using external knowledge sources (Sharma, 2019), information extraction and reasoning (Isaak and Michael, 2016) and more (Peng et al, 2015;Liu et al, 2017a,b;Fähndrich et al, 2018;Klein and Nabi, 2019;Zhang et al, 2019Zhang et al, , 2020a. Newer approaches use LMs to assign a probability to a sentence by replacing the pronoun with an entity, one at a time, and pick the more probable sentence (Trinh and Le, 2018;Opitz and Frank, 2018;Radford et al, 2019;Kocijan et al, 2019).…”

Section: Progress On Wscmentioning

confidence: 99%

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Elazar¹,

Zhang²,

Goldberg³

et al. 2021

Preprint

View full text Add to dashboard Cite

The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. We begin by showing that the current evaluation method of WS is sub-optimal and propose a modification that makes use of twin sentences for evaluation. We also propose two new baselines that indicate the existence of biases in WS benchmarks. Finally, we propose a method for evaluating WS-like sentences in a zero-shot setting and observe that popular language models perform randomly in this setting. We conclude that much of the apparent progress on WS may not necessarily reflect progress in commonsense reasoning, but much of it comes from supervised data, which is not likely to account for all the required commonsense reasoning skills and knowledge.

show abstract

A Marker Passing Approach to Winograd Schemas

Cited by 5 publications

References 22 publications

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

The Defeat of the Winograd Schema Challenge

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Contact Info

Product

Resources

About