Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification

Mozes, Maximilian; Bartolo, Max; Stenetorp, Pontus; Kleinberg, Bennett; Griffin, Lewis D.

doi:10.18653/v1/2021.emnlp-main.651

Cited by 8 publications

(17 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alzantot et al [302] expose that sentiment analysis models can be fooled by synonym substitution attacks, as illustrated by their adversarial examples in Table 8.1. This has motivated a myriad of works making NLP models more robust against such attacks [303,304,305,306].…”

Section: Certified Robustness Against Natural Language Attacksmentioning

confidence: 99%

Causal Machine Learning: A Survey and Open Problems

Jean¹,

Lynch²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in CausalML into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.

show abstract

Section: Certified Robustness Against Natural Language Attacksmentioning

confidence: 99%

Causal Machine Learning: A Survey and Open Problems

Jean¹,

Lynch²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Not only that, but adversarial attacks can reveal important vulnerabilities in our systems (Zhang et al, 2020a). Although previous work has studied adversarial examples in NLP (Li et al, 2017;Zang et al, 2020;Morris et al, 2020;Mozes et al, 2021) most of them focused on accuracy as a metric of interest. Among the ones that studied toxicity and other ethical considerations (Wallace et al, 2019;Sheng et al, 2020) they did not put the focus on either conversational agents or they did not consider attacks being imperceptible.…”

Section: Related Workmentioning

confidence: 99%

Robust Conversational Agents against Imperceptible Toxicity Triggers

Mehrabi¹,

Beirami²,

Galstyan³

2022

Preprint

View full text Add to dashboard Cite

Warning: this paper contains content that may be offensive or upsetting. Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.

show abstract

“…The Adversarial NLI project asks humans to annotate mislabeled data and uses humans as adversaries to create a benchmark natural language inference (NLI) dataset for a more robust NLP model (Nie et al, 2020). The most related work compares the performance of human-and machinegenerated word-level adversarial examples for NLP classification tasks (Mozes et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…A saliency map shows what words the target model identifies as most important that are most likely to affect the prediction, and then marks those words with colors with different intensities. Unlike (Mozes et al, 2021), where the interface displays word saliencies calculated by replacing the word with an out-of-vocabulary token, we implement the built-in method in each automated attack to calculate the saliency score. For example, BAE and TextFooler simply delete the word and calculate the word saliencies, while PWWS replaces each word with an unknown token and calculates the weighted saliency.…”

Section: Generating Adversarial Examplesmentioning

confidence: 99%

“…Thus far, human expertise in adversarial NLP tasks has been limited. There is a growing body of work in which humans are asked to craft inputs where a given model will perform poorly, but they receive little support in doing so -sometimes word saliences (Mozes et al, 2021), sometimes model predictions (Kiela et al, 2021), and sometimes even less. Overall, the effort between humans and machines is still largely separate; that is, humans generate adversarial examples alone based on model interpretations, without directly interacting with any attack algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Stronger Adversarial Baselines Through Human-AI Collaboration

You¹,

Lowd²

2022

Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

View full text Add to dashboard Cite

Natural language processing (NLP) systems are often used for adversarial tasks such as detecting spam, abuse, hate speech, and fake news. Properly evaluating such systems requires dynamic evaluation that searches for weaknesses in the model, rather than a static test set. Prior work has evaluated such models on both manually and automatically generated examples, but both approaches have limitations: manually constructed examples are time-consuming to create and are limited by the imagination and intuition of the creators, while automatically constructed examples are often ungrammatical or labeled inconsistently. We propose to combine human and AI expertise in generating adversarial examples, benefiting from humans' expertise in language and automated attacks' ability to probe the target system more quickly and thoroughly. We present a system that facilitates attack construction, combining human judgment with automated attacks to create better attacks more efficiently. Preliminary results from our own experimentation suggest that human-AI hybrid attacks are more effective than either human-only or AI-only attacks. A complete user study to validate these hypotheses is still pending.

show abstract

Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification

Cited by 8 publications

References 38 publications

Causal Machine Learning: A Survey and Open Problems

Causal Machine Learning: A Survey and Open Problems

Robust Conversational Agents against Imperceptible Toxicity Triggers

Towards Stronger Adversarial Baselines Through Human-AI Collaboration

Contact Info

Product

Resources

About