Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1221
|View full text |Cite
|
Sign up to set email alerts
|

Universal Adversarial Triggers for Attacking and Analyzing NLP

Abstract: Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradientguided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause S… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
390
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 474 publications
(431 citation statements)
references
References 27 publications
3
390
0
Order By: Relevance
“…Finally, we suggest that care must be taken in the training of models for RE, as it appears likely that classifiers are susceptible to overfitting on non-syntactic features. This may be alleviated by the creation of training data that depend heavily on syntactic features, and advancing other methodologies such as data augmentation and Universal Adversarial Triggers 23 .…”
Section: Discussionmentioning
confidence: 99%
“…Finally, we suggest that care must be taken in the training of models for RE, as it appears likely that classifiers are susceptible to overfitting on non-syntactic features. This may be alleviated by the creation of training data that depend heavily on syntactic features, and advancing other methodologies such as data augmentation and Universal Adversarial Triggers 23 .…”
Section: Discussionmentioning
confidence: 99%
“…Attackers can also use the optimisation-based attack algorithms which are to find the optimised perturbation by maximising or minimising their objective instead of just finding any perturbation that works [3,4,19]. More intriguingly, there are L 1 norm bounded attack algorithms to limit the number of perturbed pixels [3,25], universal adversarial perturbations that work for all examples in the test dataset [21,33], etc.…”
Section: Related Work 51 Adversarial Machine Learningmentioning
confidence: 99%
“…Adversarial triggers on natural language generation: In 2019, Wallace et al [423] introduced a type of adversarial example denoted universal adversarial triggers (abbreviated with UATs in the following) which were identified via a gradient-based search. UATs are defined as "input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset".…”
Section: Risk Ibmentioning
confidence: 99%
“…UATs are defined as "input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset". These UATs were able to fool a question-answering model to answer with "to kill american people" to most "why" questions formulated in a dataset [423]. Moreover, they analyzed UATs placed within user inputs to the GPT-2 language model of OpenAI [347] known for high-quality outputs [239].…”
Section: Risk Ibmentioning
confidence: 99%
See 1 more Smart Citation