Data Contamination: From Memorization to Exploitation

Inbal, Magar,; Schwartz, Roy

doi:10.18653/v1/2022.acl-short.18

Cited by 30 publications

(27 citation statements)

References 11 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the success of these models comes with a price-they are trained on vast amounts of mostly web-based data, which often contains social stereotypes and biases that the models might pick up (Bender et al, 2021;Dodge et al, 2021;De-Arteaga et al, 2019). Combined with recent evidence that the memorization capacity of training data grows with model size (Magar and Schwartz, 2022;Carlini et al, 2022), the risk of Figure 1: We study the effect of model size on occupational gender bias in two setups: using prompt based method (A), and using Winogender as a downstream task (B). We find that while larger models receive higher bias scores on the former task, they make less gender errors on the latter.…”

Section: Introductionmentioning

confidence: 94%

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

Yarden¹,

Inbal²,

Schwartz³

2022

Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Self Cite

View full text Add to dashboard Cite

The size of pretrained models is increasing, and so is their performance on a variety of NLP tasks. However, as their memorization capacity grows, they might pick up more social biases. In this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). We measure bias in three masked language model families (RoBERTa, DeBERTa, and T5) in two setups: directly using prompt based method, and using a downstream task (Winogender). We find on the one hand that larger models receive higher bias scores on the former task, but when evaluated on the latter, they make fewer gender errors. To examine these potentially conflicting results, we carefully investigate the behavior of the different models on Winogender. We find that while larger models outperform smaller ones, the probability that their mistakes are caused by gender bias is higher. Moreover, we find that the proportion of stereotypical errors compared to anti-stereotypical ones grows with the model size. Our findings highlight the potential risks that can arise from increasing model size. 1

show abstract

Section: Introductionmentioning

confidence: 94%

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

Yarden¹,

Inbal²,

Schwartz³

2022

Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…online before ChatGPT's knowledge cutoff date (September 2021) [1,23,52,66]. Given that ChatGPT utilized vast swaths of online data for training, testing it with datasets available before this cutoff raises concerns about data contamination [14,28,37]-essentially, testing GPT-4 with its training data. While it is convenient to use existent datasets for initial GPT-4 benchmarks, it is crucial for an unbiased assessment in which new datasets are curated and used.…”

Section: Related Work 21 Crowd Workers Vs Gptmentioning

confidence: 99%

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

He,

Huang,

Ding

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy. CCS CONCEPTS• Information systems → Crowdsourcing; • Applied computing → Annotation; • Computing methodologies → Natural language processing; • Human-centered computing → Human computer interaction (HCI).

show abstract

“…To study memory recall, we require a set of inputs that trigger this process. Prior work on memorization focused on detecting instances whose inclusion in the training data has a specific influence on model behavior, such as increased accuracy on those instances (Feldman and Zhang, 2020;Magar and Schwartz, 2022;Carlini et al, , 2021Carlini et al, , 2019. As a result, memorized instances differ across models and training parameterization.…”

Section: Criteria For Detecting Memory Recallmentioning

confidence: 99%

Understanding Transformer Memorization Recall Through Idioms

Haviv,

Cohen,

Gidron

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

To produce accurate predictions, language models (LMs) must balance between generalization and memorization. Yet, little is known about the mechanism by which transformer LMs employ their memorization capacity. When does a model decide to output a memorized phrase, and how is this phrase then retrieved from memory? In this work, we offer the first methodological framework for probing and characterizing recall of memorized sequences in transformer LMs. First, we lay out criteria for detecting model inputs that trigger memory recall, and propose idioms as inputs that typically fulfill these criteria. Next, we construct a dataset of English idioms and use it to compare model behavior on memorized vs. non-memorized inputs. Specifically, we analyze the internal prediction construction process by interpreting the model's hidden representations as a gradual refinement of the output probability distribution. We find that across different model sizes and architectures, memorized predictions are a two-step process: early layers promote the predicted token to the top of the output distribution, and upper layers increase model confidence. This suggests that memorized information is stored and retrieved in the early layers of the network. Last, we demonstrate the utility of our methodology beyond idioms in memorized factual statements. Overall, our work makes a first step towards understanding memory recall, and provides a methodological basis for future studies of transformer memorization. 1

show abstract

Data Contamination: From Memorization to Exploitation

Cited by 30 publications

References 11 publications

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

Understanding Transformer Memorization Recall Through Idioms

Contact Info

Product

Resources

About