Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

Huang, Po-Sen; Stanforth, Robert; Welbl, Johannes; Dyer, Chris; Yogatama, Dani; Gowal, Sven; Dvijotham, Krishnamurthy; Kohli, Pushmeet

doi:10.18653/v1/d19-1419

Cited by 100 publications

(123 citation statements)

References 25 publications

Supporting

Mentioning

116

Contrasting

Order By: Relevance

“…Additionally, the edit distance constraint is sometimes used when improving the robustness of models. For example, Huang et al (2019) uses Interval Bound Propagation to ensure model robustness to perturbations within some edit distance of the input.…”

Section: Overlapmentioning

confidence: 99%

Reevaluating Adversarial Examples in Natural Language

Morris¹,

Lifland²,

Lanchantin³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

State-of-the-art attacks on NLP models lack a shared definition of what constitutes a successful attack. These differences make the attacks difficult to compare and hindered the use of adversarial examples to understand and improve NLP models. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows four proposed linguistic constraints. We categorize previous attacks based on these constraints. For each constraint, we suggest options for human and automatic evaluation methods. We use these methods to evaluate two state-of-the-art synonym substitution attacks. We find that perturbations often do not preserve semantics, and 38% introduce grammatical errors. Next, we conduct human studies to find a threshold for each evaluation method that aligns with human judgment. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences. With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points. 1

show abstract

Section: Overlapmentioning

confidence: 99%

Reevaluating Adversarial Examples in Natural Language

Morris¹,

Lifland²,

Lanchantin³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…There has also been a trend in usage of certified robustness approaches (Ko et al, 2019;Huang et al, 2019;Shi et al, 2020) which provide guarantees on the minimum performance of models. The main technique so far is to propagate interval bounds around input word embeddings and has been applied for robustness to synonyms change.…”

Section: Resultsmentioning

confidence: 99%

What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name?

Balasubramanian¹,

Jain²,

Jindal³

et al. 2020

Proceedings of the 5th Workshop on Representation Learning for NLP

View full text Add to dashboard Cite

We evaluate named entity representations of BERT-based NLP models by investigating their robustness to replacements from the same typed class in the input. We highlight that on several tasks while such perturbations are natural, state of the art trained models are surprisingly brittle. The brittleness continues even with the recent entity-aware BERT models. We also try to discern the cause of this non-robustness, considering factors such as tokenization and frequency of occurrence. Then we provide a simple method that ensembles predictions from multiple replacements while jointly modeling the uncertainty of type annotations and label predictions. Experiments on three NLP tasks show that our method enhances robustness and increases accuracy on both natural and adversarial datasets.

show abstract

“…(2) Interval Bound Propagation (IBP) (Dvijotham et al, 2018) is proposed as a new technique to theoretically consider the worst-case perturbation. Recent works (Jia et al, 2019;Huang et al, 2019) have applied IBP in the NLP domain to certify the robustness of models. (3) Language models including GPT2 (Radford et al, 2019) may also function as an anomaly detector to probe the inconsistent and unnatural adversarial sentences.…”

Section: Discussion and Future Workmentioning

confidence: 99%

T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

Wang¹,

Pei²,

Pan³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Adversarial attacks against natural language processing systems, which perform seemingly innocuous modifications to inputs, can induce arbitrary mistakes to the target models. Though raised great concerns, such adversarial attacks can be leveraged to estimate the robustness of NLP models. Compared with the adversarial example generation in continuous data domain (e.g., image), generating adversarial text that preserves the original meaning is challenging since the text space is discrete and non-differentiable. To handle these challenges, we propose a target-controllable adversarial attack framework T3, which is applicable to a range of NLP tasks. In particular, we propose a tree-based autoencoder to embed the discrete text data into a continuous representation space, upon which we optimize the adversarial perturbation. A novel tree-based decoder is then applied to regularize the syntactic correctness of the generated text and manipulate it on either sentence (T3(SENT)) or word (T3(WORD)) level. We consider two most representative NLP tasks: sentiment analysis and question answering (QA). Extensive experimental results and human studies show that T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human. Moreover, we show that the generated adversarial texts have high transferability which enables the black-box attacks in practice. Our work sheds light on an effective and general way to examine the robustness of NLP models. Our code is publicly available at https://github.com/AI-secure/T3/.

show abstract

Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

Cited by 100 publications

References 25 publications

Reevaluating Adversarial Examples in Natural Language

Reevaluating Adversarial Examples in Natural Language

What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name?

T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

Contact Info

Product

Resources

About