A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Cui, Guixiang; Lifan, Yuan,; He, Bingxiang; Yangyi, Chen,; Liu, Zhiyuan; Sun, Maosong

doi:10.48550/arxiv.2206.08514

Cited by 3 publications

(11 citation statements)

References 35 publications

(137 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Technically speaking, existing data-level defenses can be further categorized as robust training [57] and backdoored text detection and elimination [5], [51], [16], [36], [10]. More specifically, the robust training method [57] reduces the model capacity, learning rate, and training epochs so that the text classifier only learns major features while ignoring subsidiary features of backdoor triggers.…”

Section: A Existing Defenses and Their Limitationsmentioning

confidence: 99%

“…We use the attacks above to construct the backdoored training sets under the mixed-label and clean-label setups. We set the poisoning rate as p = 0.1 for the mixed-label attack and p = 0.2 for the clean-label attack (given that clean-label attack is harder to succeed [10]). Appendix E1 introduces the implementation details of these attacks.…”

Section: A Experiments Setupmentioning

confidence: 99%

“…SynBkd paraphrases normal samples into sentences with a pre-specified syntactic structure S(SBAR)(,)(NP)(VP)(.). We use the implementations from [10] and adopt the default attack hyper-parameters for each attack method.…”

Section: E Details About Empirical Evaluationmentioning

confidence: 99%

“…which can be categorized into data-level defenses [5], [10], [16], [36] and model-level defenses [1], [42], [33]. Data-level defenses aim to train a secure text classifier upon a potentially backdoored training dataset.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Pei,

Jia,

Guo

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications. Existing research endeavors have proposed many defenses against backdoor attacks. Despite demonstrating certain empirical defense efficacy, none of these techniques could provide a formal and provable security guarantee against arbitrary attacks. As a result, they can be easily broken by strong adaptive attacks, as shown in our evaluation. In this work, we propose TextGuard, the first provable defense against backdoor attacks on text classification. In particular, TextGuard first divides the (backdoored) training data into sub-training sets, achieved by splitting each training sentence into sub-sentences. This partitioning ensures that a majority of the sub-training sets do not contain the backdoor trigger. Subsequently, a base classifier is trained from each sub-training set, and their ensemble provides the final prediction. We theoretically prove that when the length of the backdoor trigger falls within a certain threshold, TextGuard guarantees that its prediction will remain unaffected by the presence of the triggers in training and testing inputs. In our evaluation, we demonstrate the effectiveness of TextGuard on three benchmark text classification tasks, surpassing the certification accuracy of existing certified defenses against backdoor attacks. Furthermore, we propose additional strategies to enhance the empirical performance of TextGuard. Comparisons with state-ofthe-art empirical defenses validate the superiority of TextGuard in countering multiple backdoor attacks. Our code and data are available at https://github.com/AI-secure/TextGuard.

show abstract

Section: A Existing Defenses and Their Limitationsmentioning

confidence: 99%

Section: A Experiments Setupmentioning

confidence: 99%

Section: E Details About Empirical Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Pei,

Jia,

Guo

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

show abstract

“…Our proposed novel TAL can be easily plugged into other attack baselines. Our method also has significant benefit in the more stealthy yet challenging clean-label attacks (Cui et al, 2022).…”

Section: Positive Negativementioning

confidence: 99%

Attention-Enhancing Backdoor Attacks Against BERT-based Models

Lyu,

Zheng,

Pang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Recent studies have revealed that Backdoor Attacks can threaten the safety of natural language processing (NLP) models. Investigating the strategies of backdoor attacks will help to understand the model's vulnerability. Most existing textual backdoor attacks focus on generating stealthy triggers or modifying model weights. In this paper, we directly target the interior structure of neural networks and the backdoor mechanism. We propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior by directly manipulating the attention patterns. Our loss can be applied to different attacking methods to boost their attack efficacy in terms of attack successful rates and poisoning rates. It applies to not only traditional dirty-label attacks, but also the more challenging clean-label attacks. We validate our method on different backbone models (BERT, RoBERTa, and DistilBERT) and various tasks (Sentiment Analysis, Toxic Detection, and Topic Classification).

show abstract

TrojBits: A Hardware Aware Inference-Time Attack on Transformer-Based Language Models

Al Ghanim,

Santriaji,

Lou

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Transformer-based language models demonstrate exceptional performance in Natural Language Processing (NLP) tasks but remain susceptible to backdoor attacks involving hidden input triggers. Trojan injection via hardware bitflips presents a significant challenge for contemporary language models. However, previous research overlooks practical hardware considerations, such as DRAM and cache memory structures, resulting in unrealistic attacks that demand the manipulation of an excessive number of parameters and bits. In this paper, we present TrojBits, a novel approach requiring minimal bit-flips to effectively insert Trojans into real-world Transformer language model systems. This is achieved through a three-module framework designed to efficiently target Transformer-based language models, consisting of Vulnerable Parameters Ranking (VPR), Hardware-aware Attack Optimization (HAO), and Vulnerable Bits Pruning (VBP). Within the VPR module, we are the first to employ Gradient-guided Fisher information to identify the most susceptible Transformer parameters, specifically in the word embedding layer. The HAO module then redistributes these parameters across multiple triggers, conforming to hardware constraints by incorporating a regularization term in the trojan optimization methodology. Finally, the VBP module aims to reduce the number of bit-flips by discarding less significant bits. We evaluate TrojBits on two representative NLP models, BERT and XLNE, on three classification tasks (SST2, OffensEval, and AG’s News). Our results demonstrate that our TrojBits successfully achieves the inference-time attack with only 64 parameters out of 116 million and 90-bit flips while maintaining the model performance.

show abstract

A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

Cited by 3 publications

References 35 publications

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

TextGuard: Provable Defense against Backdoor Attacks on Text Classification

Attention-Enhancing Backdoor Attacks Against BERT-based Models

TrojBits: A Hardware Aware Inference-Time Attack on Transformer-Based Language Models

Contact Info

Product

Resources

About