Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP 2021
DOI: 10.18653/v1/2021.blackboxnlp-1.33
|View full text |Cite
|
Sign up to set email alerts
|

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Abstract: Interpretability methods like INTEGRATED GRADIENT and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 27 publications
(16 reference statements)
0
5
0
Order By: Relevance
“…• 𝐿 𝑝 Distance: An intuitive and straightforward metric to compare two explanation maps is to compute the normed distance between them. Some of the widely used metrics are median 𝐿 1 distance, used in [15], and the 𝐿 2 distance [17,36,107]. Mean Squared Error (MSE), a metric derived from 𝐿 2 distance, is also very popular.…”
Section: Evaluating the Robustness Of Explanation Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…• 𝐿 𝑝 Distance: An intuitive and straightforward metric to compare two explanation maps is to compute the normed distance between them. Some of the widely used metrics are median 𝐿 1 distance, used in [15], and the 𝐿 2 distance [17,36,107]. Mean Squared Error (MSE), a metric derived from 𝐿 2 distance, is also very popular.…”
Section: Evaluating the Robustness Of Explanation Methodsmentioning
confidence: 99%
“…From another perspective, while there have been many surveys of literature on adversarial attacks and robustness [7,8,11,25,29,35,46,51,57,61,65,69,75,77,101,104,112,113,116,118,119,121,122,129,135] -which focus on attacking the predictive outcome of these models, there have been no effort so far to study and consolidate existing efforts on attacks on explainability of DNN models. Many recent efforts have demonstrated the vulnerability of explanations (or attributions 1 ) to human-imperceptible input perturbations across image, text and tabular data [36,45,55,62,107,108,133]. Similarly, there have also been many efforts in recent years in securing the stability of such explanations in [13,26,30,36,37,50,54,73,97,99,106,125].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…(Ghorbani, Abid, and Zou 2019) showed that explanations can easily be misled by introducing imperceptible noise in the input image. Several other works have highlighted similar problems on vision, natural language and reinforcement learning such as (Adebayo et al 2018;Dombrowski et al 2019;Slack et al 2020;Kindermans et al 2019;Sinha et al 2021;Huai et al 2020). Similarly, concept explanation methods are also fragile to small perturbations to input samples (Brown and Kvinge 2021).…”
Section: Related Workmentioning
confidence: 97%
“…Related work on concept-level explanations. Recent research has focused on designing concept-based deep learning methods to interpret how deep learning models can use highlevel human-understandable concepts in arriving at decisions [Ghorbani et al, 2019;Wu et al, 2020;Koh et al, 2020;Yeh et al, 2019;Mincu et al, 2021;Huang et al, 2022;Leemann et al, 2022;Sinha et al, 2021;Sinha et al, 2023]. Such concept-based deep learning models aim to incorporate high-level concepts into the learning procedure.…”
Section: Related Workmentioning
confidence: 99%