Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.24
|View full text |Cite
|
Sign up to set email alerts
|

Gradient-based Analysis of NLP Models is Manipulable

Abstract: Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses.In particular, we merge the layers of a target model with a FACADE model that overwhelms the gradients w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(16 citation statements)
references
References 29 publications
0
14
0
Order By: Relevance
“…Sundararajan et al (2017) show that in practice, gradients are saturated: they may all be close to zero for a wellfitted function, and thus not reflect importance. Adversarial methods can also distort gradient-based saliences while keeping a model's prediction the same (Ghorbani et al, 2019;Wang et al, 2020). We compare greedy rationalization to gradient saliency methods in Section 8.…”
Section: Related Workmentioning
confidence: 99%
“…Sundararajan et al (2017) show that in practice, gradients are saturated: they may all be close to zero for a wellfitted function, and thus not reflect importance. Adversarial methods can also distort gradient-based saliences while keeping a model's prediction the same (Ghorbani et al, 2019;Wang et al, 2020). We compare greedy rationalization to gradient saliency methods in Section 8.…”
Section: Related Workmentioning
confidence: 99%
“…L2E can be applied to any Natural Language Processing task to which an underlying feature-based explanation algorithm can be applied, such as Natural Language Inference and Question Answering (Wang et al, 2020). In this paper, we focus on explaining text classification models.…”
Section: Learning To Explain (L2e)mentioning
confidence: 99%
“…Some other popular explainability methods include neuron-based analysis and transfer learning ( Rethmeier et al, 2020 ) and promising gradient-based analysis, which directly reflects the knowledge learned by the model ( Wallace et al, 2019 ). However, it has been recently shown that it is relatively easy to manipulate and corrupt gradient-based explainability methods ( Wang et al, 2020 ).…”
Section: Introductionmentioning
confidence: 99%