Deduplicating Training Data Mitigates Privacy Risks in Language Models

Kandpal, Nikhil; Wallace, Eric; Raffel, Colin

doi:10.48550/arxiv.2202.06539

Cited by 16 publications

(28 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deduplication. Recent work has shown that deduplicating training data can improve model performance and reduce the risk of memorizing training data [5,43,38]. Our deduplication scheme removes code files using exact match on the sequence of alphanumeric tokens in the file.…”

Section: Codementioning

confidence: 99%

InCoder: A Generative Model for Code Infilling and Synthesis

Fried¹,

Aghajanyan²,

Lin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce INCODER, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first large generative code model that is able to infill arbitrary regions of code, which we evaluate in a zero-shot setting on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The INCODER models and code are publicly released.2 * Equal contribution 2 https://sites.google.com/view/incoder-code-models

show abstract

Section: Codementioning

confidence: 99%

InCoder: A Generative Model for Code Infilling and Synthesis

Fried¹,

Aghajanyan²,

Lin³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This finding is consistent with several concurrent works, which show similar connections in GPT-based models. These works study the impact of duplication of training sequence on regeneration of the sequence (Carlini et al, 2022;Kandpal et al, 2022), and the effect on few-shot numerical reasoning (Razeghi et al, 2022). One explanation for this phenomenon is the increase in the expected number of times labels are masked during pretraining.…”

Section: Which Factors Affect Exploitation?mentioning

confidence: 99%

Data Contamination: From Memorization to Exploitation

Inbal¹,

Schwartz²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.

show abstract

“…Membership Inference Attacks (MIA) try to determine whether or not a target sample was used in training a target model (Shokri et al, 2017;Yeom et al, 2018). These attacks be seen as privacy risk analysis tools (Murakonda and Shokri, 2020;Nasr et al, 2021;Kandpal et al, 2022), which help reveal how much the model has memorized the individual samples in its training set, and what the risk of individual users is (Nasr et al, 2019;Long et al, 2017;Salem et al, 2018;Ye et al, 2021;Carlini et al, 2021a) A group of these attacks rely on behavior of shadow models (models trained on data similar to training, to mimic the target model) to determine the membership of given samples (Jayaraman et al, 2021;Shokri et al, 2017). In the shadow model training procedure the adversary trains a batch of models m 1 ,m 2 ,...,m k as shadow models, with data from the target user.…”

Section: Related Workmentioning

confidence: 99%

Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks

Mireshghallah¹,

Goyal²,

Uniyal³

et al. 2022

Preprint

View full text Add to dashboard Cite

The wide adoption and application of Masked language models (MLMs) on sensitive data (from legal to medical) necessitates a thorough quantitative investigation into their privacy vulnerabilities -to what extent do MLMs leak information about their training data? Prior attempts at measuring leakage of MLMs via membership inference attacks have been inconclusive, implying potential robustness of MLMs to privacy attacks. In this work, we posit that prior attempts were inconclusive because they based their attack solely on the MLM's model score. We devise a stronger membership inference attack based on likelihood ratio hypothesis testing that involves an additional reference MLM to more accurately quantify the privacy risks of memorization in MLMs. We show that masked language models are extremely susceptible to likelihood ratio membership inference attacks: Our empirical results, on models trained on medical notes, show that our attack improves the AUC of prior membership inference attacks from 0.66 to an alarmingly high 0.90 level, with a significant improvement in the low-error region: at 1% false positive rate, our attack is 51× more powerful than prior work.

show abstract

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Cited by 16 publications

References 12 publications

InCoder: A Generative Model for Code Infilling and Synthesis

InCoder: A Generative Model for Code Infilling and Synthesis

Data Contamination: From Memorization to Exploitation

Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks

Contact Info

Product

Resources

About