HotFlip: White-Box Adversarial Examples for Text Classification

Ebrahimi, Javid; Rao, Anyi; Lowd, Daniel; Dou, Dejing

doi:10.18653/v1/p18-2006

Cited by 689 publications

(697 citation statements)

References 12 publications

Supporting

Mentioning

633

Contrasting

Order By: Relevance

“…The work HotFlip (Ebrahimi et al, 2017) considers to replace a letter in a sentence in order to mislead a characterlevel text classifier (each letter is encoded to a vector). For example, as shown in Figure 11, changing a single letter in a sentence alters the model's prediction on its topic.…”

Section: Attacking Words and Lettersmentioning

confidence: 99%

“…They try adding, removing or modifying the words and phrases in the sentences. In their approach, the first step is similar to HotFlip (Ebrahimi et al, 2017). For each training sample, they find the most-influential letters, called "hot characters".…”

Section: Attacking Words and Lettersmentioning

confidence: 99%

“…Meanwhile, in other application domains involving graphs, text or audio, similar adversarial attacking schemes also exist to confuse deep learning models. For example, perturbing only a couple of edges can mislead graph neural networks (Zügner et al, 2018), and inserting typos to a sentence can fool text classification or dialogue systems (Ebrahimi et al, 2017). As a result, the existence of adversarial examples in all application fields has cautioned researchers against directly adopting DNNs in safety-critical machine learning tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Replace one letter in a sentence to alter a text classifier's prediction on a sentence's topic. (Image Credit:(Ebrahimi et al, 2017)) …”

mentioning

confidence: 99%

See 3 more Smart Citations

Adversarial Attacks and Defenses in Images, Graphs and Text: A Review

Liu

et al. 2020

Int. J. Autom. Comput.

487

215

View full text Add to dashboard Cite

Deep neural networks (DNN) have achieved unprecedented success in numerous machine learning tasks in various domains. However, the existence of adversarial examples has raised concerns about applying deep learning to safety-critical applications. As a result, we have witnessed increasing interests in studying attack and defense mechanisms for DNN models on different data types, such as images, graphs and text. Thus, it is necessary to provide a systematic and comprehensive overview of the main threats of attacks and the success of corresponding countermeasures. In this survey, we review the state of the art algorithms for generating adversarial examples and the countermeasures against adversarial examples, for the three popular data types, i.e., images, graphs and text.

show abstract

Section: Attacking Words and Lettersmentioning

confidence: 99%

Section: Attacking Words and Lettersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Replace one letter in a sentence to alter a text classifier's prediction on a sentence's topic. (Image Credit:(Ebrahimi et al, 2017)) …”

mentioning

confidence: 99%

See 2 more Smart Citations

Adversarial Attacks and Defenses in Images, Graphs and Text: A Review

Liu

et al. 2020

Int. J. Autom. Comput.

487

215

View full text Add to dashboard Cite

show abstract

“…For adversarial attacks, white-box attacks have full access to the target model while black-box attacks can only explore the models by observing the outputs with limited trials. Ebrahimi et al (2017) propose a gradient-based white-box model to attack character-level classifiers via an atomic flip operation. Small character-level transformations, such as swap, deletion, and insertion, are applied on critical tokens identified with a scoring strategy (Gao et al, 2018) or gradient-based computation (Liang et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Zhou¹,

Jiang²,

Chang³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to discriminate perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.

show abstract

Machine Learning Adversarial Attacks: A Survey Beyond

Magoo¹,

Garg²

2021

Machine Learning Techniques and Analytics for Cloud Security

View full text Add to dashboard Cite

HotFlip: White-Box Adversarial Examples for Text Classification

Cited by 689 publications

References 12 publications

Adversarial Attacks and Defenses in Images, Graphs and Text: A Review

Adversarial Attacks and Defenses in Images, Graphs and Text: A Review

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Machine Learning Adversarial Attacks: A Survey Beyond

Contact Info

Product

Resources

About