2022
DOI: 10.48550/arxiv.2203.08242
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data Contamination: From Memorization to Exploitation

Abstract: Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quanti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 7 publications
0
1
0
Order By: Relevance
“…This approach now spans across several domains including fact-based question-answering (Hendrycks et al, 2021b), language understanding (Wang et al, 2019a), zero-shot classification of vision-language models (Gadre et al, 2023), large-scale vision model evaluation , multi-modal model evaluation (Yue et al, 2023;Zhou et al, 2022), and text-to-image generation (Bakr et al, 2023;Lee et al, 2023). Despite these benchmarks having vast coverage of testing concepts, the obvious downsides are two-fold: (1) they are static in nature and hence can always be susceptible to test-set contamination (Magar and Schwartz, 2022), and (2) their large sizes renders them very expensive to run full model evaluations on.…”
Section: Extended Related Workmentioning
confidence: 99%
“…This approach now spans across several domains including fact-based question-answering (Hendrycks et al, 2021b), language understanding (Wang et al, 2019a), zero-shot classification of vision-language models (Gadre et al, 2023), large-scale vision model evaluation , multi-modal model evaluation (Yue et al, 2023;Zhou et al, 2022), and text-to-image generation (Bakr et al, 2023;Lee et al, 2023). Despite these benchmarks having vast coverage of testing concepts, the obvious downsides are two-fold: (1) they are static in nature and hence can always be susceptible to test-set contamination (Magar and Schwartz, 2022), and (2) their large sizes renders them very expensive to run full model evaluations on.…”
Section: Extended Related Workmentioning
confidence: 99%
“…Early work argued that GPT-3 is relatively insensitive to contamination, as the large amount of data involved implies that little over-fitting or memorisation should occur [Brown et al, 2020]. On the other hand, as models get larger, they memorise more [Carlini et al, 2021, Magar andSchwartz, 2022], and as datasets get larger, we might expect that the chances of accidentally ingesting test data increase. Below, we discuss three key strategies for understanding the impact of contamination; all indicate that data contamination causes large changes in benchmark performance.…”
Section: Training Contamination: Using Test Set Information At Train-...mentioning
confidence: 99%
“…Recent works exhibit various cases which highlight the sensitivity of downstream behaviour of LLMs (and their smaller variants) to the frequency of observed overlap between pre-training corpora * Corresponding author and test set (Carlini et al, 2022;Tänzer et al, 2022;Razeghi et al, 2022;Magar and Schwartz, 2022;Lewis et al, 2020). In the generative setting, several issues such as hallucination (Dziri et al, 2022), undesired biases (Feng et al, 2023;Kirk et al, 2021), or toxicity (Gehman et al, 2020) have been attributed partly or fully to the characteristics of the pre-training data, while a parallel line of works have emphasised on the positive role of filtering the pre-training data for safety and factual grounding (Thoppilan et al, 2022).…”
Section: Introductionmentioning
confidence: 99%