Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1496
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Abstract: Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to discriminate perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
68
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 72 publications
(68 citation statements)
references
References 28 publications
(39 reference statements)
0
68
0
Order By: Relevance
“…attack models. Character-based models Ebrahimi et al, 2018;Gao et al, 2018, inter alia) use misspellings to attack the victim systems; however, these attacks can often be defended by a spell checker (Pruthi et al, 2019;Zhou et al, 2019b;Jones et al, 2020). Many sentence-level models (Iyyer et al, 2018;Wang et al, 2020;Zou et al, 2020, inter alia) have been developed to introduce more sophisticated token/phrase perturbations.…”
Section: Adversarial Trainingmentioning
confidence: 99%
“…attack models. Character-based models Ebrahimi et al, 2018;Gao et al, 2018, inter alia) use misspellings to attack the victim systems; however, these attacks can often be defended by a spell checker (Pruthi et al, 2019;Zhou et al, 2019b;Jones et al, 2020). Many sentence-level models (Iyyer et al, 2018;Wang et al, 2020;Zou et al, 2020, inter alia) have been developed to introduce more sophisticated token/phrase perturbations.…”
Section: Adversarial Trainingmentioning
confidence: 99%
“…However, the spelling check cannot deal with word-level attacks, such as synonym substitution. Zhou et al [34] proposed a novel framework to identify adversarial texts, which could effectively block adversarial texts without involving the model structure modification and retraining with updated parameters. Their method is simply evaluated on limited adversarial attacks such as character or word replacement, but the performance in some challenging adversarial attacks is still unclear, like synonym substitution attacks [24,25].…”
Section: B Defensesmentioning
confidence: 99%
“…Tan et al (2020) showed that simply fine-tuning a trained model for a single epoch on appropriately generated adversarial training data is sufficient to harden the model against inflectional adversaries. Instead of adversarial training, Piktus et al (2019) train word embeddings to be robust to misspellings, while Zhou et al (2019b) propose using a BERT-based model to detect adversaries and recover clean examples. Jia et al (2019) and Huang et al (2019) use Interval Bound Propagation to train provably robust pre-Transformer models, while Shi et al (2020) propose an efficient algorithm for training certifiably robust Transformer architectures.…”
Section: Related Workmentioning
confidence: 99%
“…Existing work on adversarial robustness for NLP primarily focuses on adversarial training methods (Belinkov and Bisk, 2018;Ribeiro et al, 2018;Tan et al, 2020) or classifying and correcting adversarial examples (Zhou et al, 2019a). However, this effectively increases the size of the training dataset by including adversarial examples or training a new model to identify and correct perturbations, thereby significantly increasing the overall computational cost of creating robust models.…”
Section: Introductionmentioning
confidence: 99%