PPT: Backdoor Attacks on Pre-trained Models via Poisoned Prompt Tuning

Du, Wei; Zhao, Yichun; Li, Boqun; Liu, Gongshen; Wang, Shilin

doi:10.24963/ijcai.2022/96

Cited by 10 publications

(9 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Shi et al (Shi et al 2022) propose a malicious prompt template construction method to probe the security performance of PLMs. Du et al (Du et al 2022) propose obtaining poisoned prompts for PLMs and corresponding downstream tasks by prompt tuning.…”

Section: Related Work Adversarial Attackmentioning

confidence: 99%

Mutual-Modality Adversarial Attack with Semantic Perturbation

Ye,

Yu,

Liu

et al. 2024

AAAI

View full text Add to dashboard Cite

Adversarial attacks constitute a notable threat to machine learning systems, given their potential to induce erroneous predictions and classifications. However, within real-world contexts, the essential specifics of the deployed model are frequently treated as a black box, consequently mitigating the vulnerability to such attacks. Thus, enhancing the transferability of the adversarial samples has become a crucial area of research, which heavily relies on selecting appropriate surrogate models. To address this challenge, we propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach is accomplished by leveraging the pre-trained CLIP model. Firstly, we conduct a visual attack on the clean image that causes semantic perturbations on the aligned embedding space with the other textual modality. Then, we apply the corresponding defense on the textual modality by updating the prompts, which forces the re-matching on the perturbed embedding space. Finally, to enhance the attack transferability, we utilize the iterative training strategy on the visual attack and the textual defense, where the two processes optimize from each other. We evaluate our approach on several benchmark datasets and demonstrate that our mutual-modal attack strategy can effectively produce high-transferable attacks, which are stable regardless of the target networks. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.

show abstract

Section: Related Work Adversarial Attackmentioning

confidence: 99%

Mutual-Modality Adversarial Attack with Semantic Perturbation

Ye,

Yu,

Liu

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Note that there are other existing studies on backdoors against prompt-tunning; however, most of them focus on taskspecific backdoors. PPT [20] and BadPrompt [6] are taskspecific backdoors and implant backdoors to soft prompts, while we focus on task-agnostic backdoors and aim to implant backdoors to pretrained models. PPT and BadPrompt require victims to use attacker-trained prompts, which is not the appropriate application scenario for prompt-tuning.…”

Section: Disscusionmentioning

confidence: 99%

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

Wei,

Meng,

Zhang

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the taskagnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios. 1

show abstract

“…For example, Shen et al [165] proposed to map some particular tokens (e.g., the classification token in BERT [93]) to a target output representation in the pre-trained NLP model for the poisoned text with trigger, such that the backdoor could be activated in downstream tasks through the token representation. The poisoned prompt tuning attack [55] proposed to learn a poisoned soft prompt for a specific downstream task based on a fixed pre-trained model, and when the user used the pre-trained model and the poisoned prompt together, then the backdoor would be activated by the trigger in the corresponding downstream task. The layer-wise weight poisoning (LWP) attack [105] studied the setting that the backdoored pre-trained model was obtained by retraining an benign pre-trained model based on the poisoned dataset and the benign training dataset of the downstream task.…”

Section: Full Control Vs Partial Control Of Training Processmentioning

confidence: 99%

Adversarial Machine Learning: A Systematic Survey of Backdoor Attack, Weight Attack and Adversarial Example

Liu¹,

Zhu²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Adversarial machine learning (AML) studies the adversarial phenomenon of machine learning, which may make inconsistent or unexpected predictions with humans. Some paradigms have been recently developed to explore this adversarial phenomenon occurring at different stages of a machine learning system, such as training-time adversarial attack (i.e., backdoor attack), deployment-time adversarial attack (i.e., weight attack), and inference-time adversarial attack (i.e., adversarial example). However, although these paradigms share a common goal, their developments are almost independent, and there is still no big picture of AML. In this work, we aim to provide a unified perspective to the AML community to systematically review the overall progress of this field. We firstly provide a general definition about AML, and then propose a unified mathematical framework to covering existing attack paradigms. According to the proposed unified framework, we can not only clearly figure out the connections and differences among these paradigms, but also systematically categorize and review existing works in each paradigm.

show abstract

PPT: Backdoor Attacks on Pre-trained Models via Poisoned Prompt Tuning

Cited by 10 publications

References 5 publications

Mutual-Modality Adversarial Attack with Semantic Perturbation

Mutual-Modality Adversarial Attack with Semantic Perturbation

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

Adversarial Machine Learning: A Systematic Survey of Backdoor Attack, Weight Attack and Adversarial Example

Contact Info

Product

Resources

About