Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger

Qi, Fanchao; Li, Mukai; Chen, Yangyi; Zhang, Zhengyan; Liu, Zhiyuan; Wang, Yasheng; Sun, Maosong

doi:10.48550/arxiv.2105.12400

Cited by 8 publications

(20 citation statements)

References 35 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, different language tasks cannot share the same trigger pattern. Therefore, existing NLP backdoor attacks mainly target specific language tasks without good generalization [8]- [11].…”

Section: B Backdoor Attacksmentioning

confidence: 99%

“…Past work [14] proposed to use a language model (e.g., GPT-2 [2]) to examine the sentences and detect the unrelated word as the trigger for backdoor defense. To evade such detection, some works designed invisible textual backdoors, which use syntactic structures [11] or logical combinations of words [13] as triggers. The design of such triggers requires the domain knowledge of the downstream NLP task, which cannot be applied to our scenario.…”

Section: B Backdoor Attack Requirementsmentioning

confidence: 99%

“…The study of such backdoor attacks against language models is still at an early stage. Some works extended the backdoor techniques from computer vision tasks to NLP tasks [8]- [11]. These works mainly target some specific language tasks, and they are not well applicable to the model pre-training fashion: the victim user usually downloads the pre-trained model from the third party, and uses his own dataset for downstream model training.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Chen,

Meng,

Sun

et al. 2021

Preprint

View full text Add to dashboard Cite

Pre-trained Natural Language Processing (NLP) models can be easily adapted to a variety of downstream language tasks. This significantly accelerates the development of language models. However, NLP models have been shown to be vulnerable to backdoor attacks, where a pre-defined trigger word in the input text causes model misprediction. Previous NLP backdoor attacks mainly focus on some specific tasks. This makes those attacks less general and applicable to other kinds of NLP models and tasks. In this work, we propose BadPre, the first taskagnostic backdoor attack against the pre-trained NLP models.The key feature of our attack is that the adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model. When this malicious model is released, any downstream models transferred from it will also inherit the backdoor, even after the extensive transfer learning process. We further design a simple yet effective strategy to bypass a state-of-the-art defense. Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.

show abstract

“…Thus, different language tasks cannot share the same trigger pattern. Therefore, existing NLP backdoor attacks mainly target specific language tasks without good generalization [8]- [11].…”

Section: B Backdoor Attacksmentioning

confidence: 99%

Section: B Backdoor Attack Requirementsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Chen,

Meng,

Sun

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We use five attack strategies to create malicious examples. (1) Insert: randomly insert one word from the trigger words set {"cf", "mn", "bb", "tq" and "mb"} at a random position of the input sentence (Kurita et al, 2020a); (2) Duplicate: duplicate a random word from the input sentence and place it right after that position; (3) Delete: randomly delete a word from the input sentence: (4) Semantic: randomly replace a word with its synonym chosen from WordNet; (5) Syntactic: rewrite the input sentence to its paraphrase with respect to a particular syntactic template (Qi et al, 2021a). Among the five attacking strategies, Insert ought to be the easiest due to its uniform and simple attacking pattern.…”

Section: Natural Language Processing Tasksmentioning

confidence: 99%

“…After overfitting all the training points, it is very likely the neural network maps representations of all training points with the same label type to an identical or very similar representations on the topmost layer (i.e., the layer right before the softmax layer), in which case we are not able to separate poisoned data points from normal ones only based on representations. 1 Secondly, though it is relatively easy for neural representations to capture the abnormality for simple and conspicuous triggers such as word insertion in NLP (Dai et al, 2019;Kurita et al, 2020b;Gan et al, 2021;Chen et al, 2021b) or pixel attack in vision (Gu et al, 2017;, it is not necessarily true or theoretically valid that subtle, hidden and complicated triggers (e.g., syntactic trigger to paraphrase a natural language Qi et al (2021a) or triggers being input dependent Nguyen & Tran (2020)) can be captured by intermediate representations, and if they are truly captured, where and how.…”

Section: Introductionmentioning

confidence: 99%

A General Framework for Defending Against Backdoor Attacks via Influence Graph

Sun¹,

Li²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this work, we propose a new and general framework to defend against backdoor attacks, inspired by the fact that attack triggers usually follow a SPECIFIC type of attacking pattern, and therefore, poisoned training examples have greater impacts on each other during training. We introduce the notion of the influence graph, which consists of nodes and edges respectively representative of individual training points and associated pair-wise influences. The influence between a pair of training points represents the impact of removing one training point on the prediction of another, approximated by the influence function (Koh & Liang, 2017). Malicious training points are extracted by finding the maximum average sub-graph subject to a particular size. Extensive experiments on computer vision and natural language processing tasks demonstrate the effectiveness and generality of the proposed framework.

show abstract