Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP 2021
DOI: 10.18653/v1/2021.blackboxnlp-1.6
|View full text |Cite
|
Sign up to set email alerts
|

What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations

Abstract: Adversarial attacks curated against NLP models are increasingly becoming practical threats. Although various methods have been developed to detect adversarial attacks, securing learning-based NLP systems in practice would require more than identifying and evading perturbed instances. To address these issues, we propose a new set of adversary identification tasks, Attacker Attribute Classification via Textual Analysis (AACTA), that attempts to obtain more detailed information about the attackers from adversaria… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 24 publications
(17 reference statements)
0
2
0
Order By: Relevance
“…These automated word-level attacks mostly rely on the knowledge of existing target models and algorithms' intensive search to locate the best synonym substitutions. However, recent work (Xie et al, 2021(Xie et al, , 2022 shows that the quality of generated adversarial examples is actually far from satisfactory, with respect to the low attack success rate across domains, incorrect grammar, and distorted meaning.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…These automated word-level attacks mostly rely on the knowledge of existing target models and algorithms' intensive search to locate the best synonym substitutions. However, recent work (Xie et al, 2021(Xie et al, , 2022 shows that the quality of generated adversarial examples is actually far from satisfactory, with respect to the low attack success rate across domains, incorrect grammar, and distorted meaning.…”
Section: Related Workmentioning
confidence: 99%
“…More recently, humans have developed automated adversarial attacks that minimally modify text while changing the output of a classifier or other NLP systems (Ebrahimi et al, 2018). These automated attacks have the potential to be much more efficient than humans, helping attackers to find weaknesses in models and helping defenders find and patch those same weaknesses (Xie et al, 2021;Zhou et al, 2019).…”
Section: Introductionmentioning
confidence: 99%