Certified Robustness to Adversarial Word Substitutions

Jia, Robin; Raghunathan, Aditi; Göksel, Kerem; Liang, Percy

doi:10.18653/v1/d19-1423

Cited by 194 publications

(254 citation statements)

References 18 publications

Supporting

Mentioning

234

Contrasting

Order By: Relevance

“…There were two things left unspecified in the definitions above: the distance function d to use in discrete input spaces, and the method for sampling from a local decision boundary. While there has been some work trying to formally characterize dis-tances for adversarial robustness in NLP (Michel et al, 2019;Jia et al, 2019), we find it more useful in our setting to simply rely on expert judgments to generate a similar but meaningfully different x given x, addressing both the distance function and the sampling method.…”

Section: Contrast Sets In Practicementioning

confidence: 99%

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Gardner

Artzi

Basmov³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

190

182

View full text Add to dashboard Cite

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets-up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

show abstract

Section: Contrast Sets In Practicementioning

confidence: 99%

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Gardner

Artzi

Basmov³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

190

182

View full text Add to dashboard Cite

show abstract

“…Note that leaderboards do not necessarily incentivize the creation of brittle and biased models; rather, because leaderboard utility is so parochial, these unintended consequences are relatively common. Some recent work has addressed the problem of brittleness by offering certificates of performance against adversarial examples (Raghunathan et al, 2018a,b;Jia et al, 2019). To tackle gender bias, the SuperGLUE leaderboard considers accuracy on the WinoBias task (Wang et al, 2019;Zhao et al, 2018).…”

Section: Robustnessmentioning

confidence: 99%

Utility is in the Eye of the User: A Critique of NLP Leaderboards

Ethayarajh

Jurafsky

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

103

View full text Add to dashboard Cite

Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards -in their current form -can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model's utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

show abstract

“…(2) Interval Bound Propagation (IBP) (Dvijotham et al, 2018) is proposed as a new technique to theoretically consider the worst-case perturbation. Recent works (Jia et al, 2019;Huang et al, 2019) have applied IBP in the NLP domain to certify the robustness of models. (3) Language models including GPT2 (Radford et al, 2019) may also function as an anomaly detector to probe the inconsistent and unnatural adversarial sentences.…”

Section: Discussion and Future Workmentioning

confidence: 99%

T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

Wang¹,

Pei²,

Pan³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Adversarial attacks against natural language processing systems, which perform seemingly innocuous modifications to inputs, can induce arbitrary mistakes to the target models. Though raised great concerns, such adversarial attacks can be leveraged to estimate the robustness of NLP models. Compared with the adversarial example generation in continuous data domain (e.g., image), generating adversarial text that preserves the original meaning is challenging since the text space is discrete and non-differentiable. To handle these challenges, we propose a target-controllable adversarial attack framework T3, which is applicable to a range of NLP tasks. In particular, we propose a tree-based autoencoder to embed the discrete text data into a continuous representation space, upon which we optimize the adversarial perturbation. A novel tree-based decoder is then applied to regularize the syntactic correctness of the generated text and manipulate it on either sentence (T3(SENT)) or word (T3(WORD)) level. We consider two most representative NLP tasks: sentiment analysis and question answering (QA). Extensive experimental results and human studies show that T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human. Moreover, we show that the generated adversarial texts have high transferability which enables the black-box attacks in practice. Our work sheds light on an effective and general way to examine the robustness of NLP models. Our code is publicly available at https://github.com/AI-secure/T3/.

show abstract

Certified Robustness to Adversarial Word Substitutions

Cited by 194 publications

References 18 publications

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Evaluating Models’ Local Decision Boundaries via Contrast Sets

Utility is in the Eye of the User: A Critique of NLP Leaderboards

T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

Contact Info

Product

Resources

About