Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2022
DOI: 10.18653/v1/2022.acl-short.18
|View full text |Cite
|
Sign up to set email alerts
|

Data Contamination: From Memorization to Exploitation

Abstract: Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quanti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
27
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 30 publications
(27 citation statements)
references
References 11 publications
(14 reference statements)
0
27
0
Order By: Relevance
“…However, the success of these models comes with a price-they are trained on vast amounts of mostly web-based data, which often contains social stereotypes and biases that the models might pick up (Bender et al, 2021;Dodge et al, 2021;De-Arteaga et al, 2019). Combined with recent evidence that the memorization capacity of training data grows with model size (Magar and Schwartz, 2022;Carlini et al, 2022), the risk of Figure 1: We study the effect of model size on occupational gender bias in two setups: using prompt based method (A), and using Winogender as a downstream task (B). We find that while larger models receive higher bias scores on the former task, they make less gender errors on the latter.…”
Section: Introductionmentioning
confidence: 94%
“…However, the success of these models comes with a price-they are trained on vast amounts of mostly web-based data, which often contains social stereotypes and biases that the models might pick up (Bender et al, 2021;Dodge et al, 2021;De-Arteaga et al, 2019). Combined with recent evidence that the memorization capacity of training data grows with model size (Magar and Schwartz, 2022;Carlini et al, 2022), the risk of Figure 1: We study the effect of model size on occupational gender bias in two setups: using prompt based method (A), and using Winogender as a downstream task (B). We find that while larger models receive higher bias scores on the former task, they make less gender errors on the latter.…”
Section: Introductionmentioning
confidence: 94%
“…online before ChatGPT's knowledge cutoff date (September 2021) [1,23,52,66]. Given that ChatGPT utilized vast swaths of online data for training, testing it with datasets available before this cutoff raises concerns about data contamination [14,28,37]-essentially, testing GPT-4 with its training data. While it is convenient to use existent datasets for initial GPT-4 benchmarks, it is crucial for an unbiased assessment in which new datasets are curated and used.…”
Section: Related Work 21 Crowd Workers Vs Gptmentioning
confidence: 99%
“…To study memory recall, we require a set of inputs that trigger this process. Prior work on memorization focused on detecting instances whose inclusion in the training data has a specific influence on model behavior, such as increased accuracy on those instances (Feldman and Zhang, 2020;Magar and Schwartz, 2022;Carlini et al, , 2021Carlini et al, , 2019. As a result, memorized instances differ across models and training parameterization.…”
Section: Criteria For Detecting Memory Recallmentioning
confidence: 99%