Memorization in NLP Fine-tuning Methods

Mireshghallah, Fatemehsadat; Uniyal, Archit; Wang, Tianhao; Evans, David; Berg-Kirkpatrick, Taylor

doi:10.48550/arxiv.2205.12506

Cited by 6 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we use multiple reference models trained on different datasets: As our Base Reference Model, we consider the pretrained, but not fine-tuned version of GPT-2. Given the large pretraining corpus of this model, it should serve as a good estimator of the general complexity of textual samples and has also been successfully used for previous implementations of reference-based attacks (Mireshghallah et al, 2022b). Similar to our neighbourhood attack, this reference model does not require an attacker to have any additional data or knowledge about the training data distribution.…”

Section: Baselinesmentioning

confidence: 99%

“…Membership Inference Attacks in NLP Specifically in NLP, membership inference attacks are an important component of language model extraction attacks (Carlini et al, 2021b;Mireshghallah et al, 2022b). Further studies of interest include work by Hisamoto et al (2020), which studies membership inference attacks in machine translation, as well as work by Mireshghallah et al (2022a), which investigates Likelihood Ratio Attacks for masked language models.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Membership Inference Attacks Against Self-supervised Speech Models

Tseng¹,

Kao²,

Lee³

2022

Interspeech 2022

View full text Add to dashboard Cite

Membership Inference attacks (MIAs) aim to predict whether a data sample was present in the training data of a machine learning model or not, and are widely used for assessing the privacy risks of language models. Most existing attacks rely on the observation that models tend to assign higher probabilities to their training samples than non-training points. However, simple thresholding of the model score in isolation tends to lead to high false-positive rates as it does not account for the intrinsic complexity of a sample. Recent work has demonstrated that reference-based attacks which compare model scores to those obtained from a reference model trained on similar data can substantially improve the performance of MIAs. However, in order to train reference models, attacks of this kind make the strong and arguably unrealistic assumption that an adversary has access to samples closely resembling the original training data. Therefore, we investigate their performance in more realistic scenarios and find that they are highly fragile in relation to the data distribution used to train reference models. To investigate whether this fragility provides a layer of safety, we propose and evaluate neighbourhood attacks, which compare model scores for a given sample to scores of synthetically generated neighbour texts and therefore eliminate the need for access to the training data distribution. We show that, in addition to being competitive with reference-based attacks that have perfect knowledge about the training data distribution, our attack clearly outperforms existing reference-free attacks as well as referencebased attacks with imperfect knowledge, which demonstrates the need for a reevaluation of the threat model of adversarial attacks.

show abstract

Section: Baselinesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Membership Inference Attacks Against Self-supervised Speech Models

Tseng¹,

Kao²,

Lee³

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Privacy Leakage in LLMs: The potential of Large Language Models (LLMs) to memorize training data poses privacy risks (Mireshghallah et al, 2022;Carlini et al, 2022b;Ippolito et al, 2022). Such memorization enables the ex-traction of private information or even direct reconstruction of training data (Parikh et al, 2022;Huang et al, 2022;Carlini et al, 2021;Zhang et al, 2022a;Elmahdy & Salem, 2023).…”

Section: Related Workmentioning

confidence: 99%

“…Despite their widespread use, these models raise significant pri- vacy concerns. Previous studies have shown that LLMs can memorize and potentially leak sensitive information from their training data (Carlini et al, 2021;Mireshghallah et al, 2022), which often includes personal details like emails (Huang et al, 2022), phone numbers and addresses (Carlini et al, 2021). There are also LLMs trained especially for clinical and medical usage with highly sensitive data (Yang et al, 2022b).…”

Section: Introductionmentioning

confidence: 99%

THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption

Chen¹,

Bao²,

Huang³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

The privacy concerns associated with the use of Large Language Models (LLMs) have grown recently with the development of LLMs such as ChatGPT. Differential Privacy (DP) techniques are explored in existing work to mitigate their privacy risks at the cost of generalization degradation. Our paper reveals that the flatness of DP-trained models' loss landscape plays an essential role in the trade-off between their privacy and generalization. We further propose a holistic framework to enforce appropriate weight flatness, which substantially improves model generalization with competitive privacy preservation. It innovates from three coarse-to-grained levels, including perturbation-aware min-max optimization on model weights within a layer, flatness-guided sparse prefix-tuning on weights across layers, and weight knowledge distillation between DP & non-DP weights copies. Comprehensive experiments of both black-box and white-box scenarios are conducted to demonstrate the effectiveness of our proposal in enhancing generalization and maintaining DP characteristics. For instance, on text classification dataset QNLI, DP-Flat achieves similar performance with non-private full fine-tuning but with DP guarantee under privacy budget ϵ = 3, and even better performance given higher privacy budgets. Codes are provided in the supplement.

show abstract

“…Deep learning (DL)-based neural language models (neural LMs) are rapidly advancing in their respective subfields of natural language processing (NLP), such as neural machine translation (NMT) [1], [2], question answering (QA) [3], [4], and text summarization [5], [6]. Along with these advances, recent studies have shown that LMs can leak memorized training data by well-chosen prompts [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. In particular,…”

Section: Introductionmentioning

confidence: 99%

Membership Inference Attacks With Token-Level Deduplication on Korean Language Models

Kim

Park

et al. 2023

IEEE Access

View full text Add to dashboard Cite

The confidentiality threat against training data has become a significant security problem in neural language models. Recent studies have shown that memorized training data can be extracted by injecting well-chosen prompts into generative language models. While these attacks have achieved remarkable success in the English-based Transformer architecture, it is unclear whether they are still effective in other language domains. This paper studies the effectiveness of attacks against Korean models and the potential for attack improvements that might be beneficial for future defense studies. The contribution of this study is two-fold. First, we perform a membership inference attack against the state-of-the-art Korean GPT model. We found approximate training data with 20% to 90% precision in the top-100 samples and confirmed that the proposed attack technique for naive GPT is valid across the language domains. Second, in this process, we observed that the redundancy of the selected sentences could hardly be detected with the existing attack method. Since the information appearing in a few documents is more likely to be meaningful, it is desirable to increase the uniqueness of the sentences to improve the effectiveness of the attack. Thus, we propose a deduplication strategy to replace the traditional word-level similarity metric with the BPE token level. Our proposed strategy reduces 6% to 22% of the underestimated samples among selected ones, thereby improving precision by up to 7%p. As a result, we show that considering both language-and model-specific characteristics is essential to improve the effectiveness of attack strategies. We also discuss possible mitigations against the MI attacks on the general language models.

show abstract

Memorization in NLP Fine-tuning Methods

Cited by 6 publications

References 0 publications

Membership Inference Attacks Against Self-supervised Speech Models

Membership Inference Attacks Against Self-supervised Speech Models

THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption

Membership Inference Attacks With Token-Level Deduplication on Korean Language Models

Contact Info

Product

Resources

About