Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.515
|View full text |Cite
|
Sign up to set email alerts
|

An Analysis of Dataset Overlap on Winograd-Style Tasks

Abstract: The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 20 publications
0
7
0
Order By: Relevance
“…Knowledge leakage in the evaluation. One recent finding in the field is about knowledge leakage between train and evaluation sets (Lewis et al, 2020b;Emami et al, 2020). Similar concerns have motivated our careful train/evaluation splits ( §4) and experiments with varying training set sizes.…”
Section: Discussionmentioning
confidence: 79%
“…Knowledge leakage in the evaluation. One recent finding in the field is about knowledge leakage between train and evaluation sets (Lewis et al, 2020b;Emami et al, 2020). Similar concerns have motivated our careful train/evaluation splits ( §4) and experiments with varying training set sizes.…”
Section: Discussionmentioning
confidence: 79%
“…So far the fundamental paradigm for NLP work based on machine learning focused on indistribution evaluation: the test sample would come from the same distribution as the train/validation samples, and the samples would be randomly split. Within that paradigm, it is essential that there are no overlaps between training and test data, which is an issue for many current resources (Lewis et al, 2021;Emami et al, 2020).…”
Section: Evaluation Methodologymentioning
confidence: 99%
“…So far the fundamental paradigm for NLP work based on machine learning focused on indistribution evaluation: the test sample would come from the same distribution as the train/validation samples, and the samples would be randomly split. Within that paradigm, it is essential that there are no overlaps between training and test data, which is an issue for many current resources (Lewis et al, 2021;Emami et al, 2020).…”
Section: Evaluation Methodologymentioning
confidence: 99%