Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Zhou, Yichao; Jiang, Jyun-Yu; Chang, Kai-Wei; Wang, Wei

doi:10.18653/v1/d19-1496

Cited by 72 publications

(68 citation statements)

References 28 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…attack models. Character-based models Ebrahimi et al, 2018;Gao et al, 2018, inter alia) use misspellings to attack the victim systems; however, these attacks can often be defended by a spell checker (Pruthi et al, 2019;Zhou et al, 2019b;Jones et al, 2020). Many sentence-level models (Iyyer et al, 2018;Wang et al, 2020;Zou et al, 2020, inter alia) have been developed to introduce more sophisticated token/phrase perturbations.…”

Section: Adversarial Trainingmentioning

confidence: 99%

Contextualized Perturbation for Textual Adversarial Attack

Li¹,

Zhang²,

Peng³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a contextaware manner. We propose three contextualized perturbations, Replace, Insert and Merge, that allow for generating outputs of varied lengths. CLARE can flexibly combine these perturbations and apply them at any position in the inputs, and is thus able to attack the victim model more effectively with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.

show abstract

Section: Adversarial Trainingmentioning

confidence: 99%

Contextualized Perturbation for Textual Adversarial Attack

Li¹,

Zhang²,

Peng³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…However, the spelling check cannot deal with word-level attacks, such as synonym substitution. Zhou et al [34] proposed a novel framework to identify adversarial texts, which could effectively block adversarial texts without involving the model structure modification and retraining with updated parameters. Their method is simply evaluated on limited adversarial attacks such as character or word replacement, but the performance in some challenging adversarial attacks is still unclear, like synonym substitution attacks [24,25].…”

Section: B Defensesmentioning

confidence: 99%

TextFirewall: Omni-Defending Against Adversarial Texts in Sentiment Classification

Wang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Sentiment classification has been broadly applied in real life, such as product recommendation and opinionoriented analysis. Unfortunately, the widely employed sentiment classification systems based on deep neural networks (DNNs) are susceptible to adversarial attacks with imperceptible perturbations into the legitimate texts (also called adversarial texts). Adversarial texts could cause erroneous outputs even without access to the target model, bringing security concerns to systems deployed in safety-critical applications. However, studies on defending against adversarial texts are still in the early stage and not ready for tackling the emerging threats, especially in dealing with unknown attacks. Investigating the minor differences between adversarial texts and legitimate texts and enhancing the robustness of target models are two mainstream ideas for defending against adversarial texts. However, both of them suffer the generalization issue in dealing with unknown adversarial attacks. In this paper, we proposed a general method, called TextFirewall, for defending against adversarial texts crafted by various adversarial attacks, which shows the potential in identifying new developed adversarial attacks in the future. Given a piece of text, our TextFirewall identifies the adversarial text by investigating the inconsistency between the target model's output and the impact value calculated by important words in the text. TextFirewall could be deployed as a third-party tool without modifying the target model and agnostic to the specific type of adversarial texts. Experimental results demonstrate that our proposed TextFirewall effectively identifies adversarial texts generated by the three state-of-the-art (SOTA) attacks and outperforms previous defense techniques. Specifically, TextFirewall achieves an average accuracy of 90.7% on IMDB and 96.9% on Yelp in defending the three SOTA attacks.

show abstract

“…Tan et al (2020) showed that simply fine-tuning a trained model for a single epoch on appropriately generated adversarial training data is sufficient to harden the model against inflectional adversaries. Instead of adversarial training, Piktus et al (2019) train word embeddings to be robust to misspellings, while Zhou et al (2019b) propose using a BERT-based model to detect adversaries and recover clean examples. Jia et al (2019) and Huang et al (2019) use Interval Bound Propagation to train provably robust pre-Transformer models, while Shi et al (2020) propose an efficient algorithm for training certifiably robust Transformer architectures.…”

Section: Related Workmentioning

confidence: 99%

“…Existing work on adversarial robustness for NLP primarily focuses on adversarial training methods (Belinkov and Bisk, 2018;Ribeiro et al, 2018;Tan et al, 2020) or classifying and correcting adversarial examples (Zhou et al, 2019a). However, this effectively increases the size of the training dataset by including adversarial examples or training a new model to identify and correct perturbations, thereby significantly increasing the overall computational cost of creating robust models.…”

Section: Introductionmentioning

confidence: 99%

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Tan¹,

Joty²,

Varshney³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by nonstandard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so. 1

show abstract

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Cited by 72 publications

References 28 publications

Contextualized Perturbation for Textual Adversarial Attack

Contextualized Perturbation for Textual Adversarial Attack

TextFirewall: Omni-Defending Against Adversarial Texts in Sentiment Classification

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Contact Info

Product

Resources

About