Fooling Network Interpretation in Image Classification

Subramanya, Akshayvarun; Pillai, Vipin; Pirsiavash, Hamed

doi:10.1109/iccv.2019.00211

Cited by 59 publications

(51 citation statements)

References 26 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the lack of ground truth, we do not know which pixel is in fact important to a model. Existing evaluation methods can be classified into three categories, namely, removing pixel features [17,18,39], setting relative ground truth [20,26,40] and user-oriented measurement [41,42].…”

Section: Evaluation Methodsmentioning

confidence: 99%

“…Another different method is to set ground truth from different perspectives. Akshayvarun et al [20] introduces adversarial patches as a true cause of prediction and shows that Grad-cam interpretation method is unreliability and easily be fooled by the adversarial example. Mengjiao et al [26] constructs carefully semi-natural dataset by pasting object pixels into scene image and trains models with these dataset.…”

Section: Setting Relative Ground Truthmentioning

confidence: 99%

“…The correctness of the FSM refers to correctly reflecting the contribution of each feature to the model prediction results. For example, some study [19] has shown that certain visual interpretation methods are misleading to users, and some interpretation methods are quite easy to be fooled by adversarial examples [20]. In order to solve the problem of no ground truth, we construct an evaluation data set and provide the ground truth of dataset.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards a Reliable Evaluation of Local Interpretation Methods

Lin

Wang

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

The growing use of deep neural networks in critical applications is making interpretability urgently to be solved. Local interpretation methods are the most prevalent and accepted approach for understanding and interpreting deep neural networks. How to effectively evaluate the local interpretation methods is challenging. To address this question, a unified evaluation framework is proposed, which assesses local interpretation methods from three dimensions: accuracy, persuasibility and class discriminativeness. Specifically, in order to assess correctness, we designed an interactive user feature annotation tool to provide ground truth for local interpretation methods. To verify the usefulness of the interpretation method, we iteratively display part of the interpretation results, and then ask users whether they agree with the category information. At the same time, we designed and built a set of evaluation data sets with a rich hierarchical structure. Surprisingly, one finding is that the existing visual interpretation methods cannot satisfy all evaluation dimensions at the same time, and each has its own shortcomings.

show abstract

Section: Evaluation Methodsmentioning

confidence: 99%

Section: Setting Relative Ground Truthmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards a Reliable Evaluation of Local Interpretation Methods

Lin

Wang

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Pruthi et al (2020) manipulate attention distributions in an end-to-end fashion; we focus on manipulating gradients. It is worth noting that we perturb models to manipulate interpretations; other work perturbs inputs (Ghorbani et al, 2019;Dombrowski et al, 2019;Subramanya et al, 2019). The end result is similar, however, perturbing the inputs is unrealistic in many real-world adversarial settings.…”

Section: Related Workmentioning

confidence: 99%

Gradient-based Analysis of NLP Models is Manipulable

Wang¹,

Tuyls²,

Wallace³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses.In particular, we merge the layers of a target model with a FACADE model that overwhelms the gradients without affecting the predictions. This FACADE model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversarial perturbations all identify unimportant or targeted tokens as being highly important. The code and a tutorial of this paper is available at http://ucinlp.github.io/facade.

show abstract

“…However, recent studies proposed several attack methods showing that some XAI models have also been easily attacked. Some examples are the input gradient [14], meaningful perturbation [15], fooling network interpretation [16], adversarial model manipulation [17], deceiving the local interpretable modelagnostic explanations (LIME), and Shapley additive explanations (SHAPs) [18].…”

Section: Introductionmentioning

confidence: 99%

Robust Adversarial Attack Against Explainable Deep Classification Models Based on Adversarial Images With Different Patch Sizes and Perturbation Ratios

Kang

Kim

2021

IEEE Access

View full text Add to dashboard Cite

In recent years, adversarial attack methods have been deceived rather easily on deep neural networks (DNNs). In practice, adversarial patches cause misclassification that can be extremely effective. However, many existing adversarial patches are used for attacking DNNs, and only a few of them apply to both the DNN and its explanation model. In this paper, we present different adversarial patches that misguide the prediction of DNN models and change the cause of prediction results of interpretation models, such as gradient-weighted class activation mapping. The proposed adversarial patches have appropriate location and perturbation ratios, which comprise visible or less visible adversarial patches. In addition, image patches within small arrays are localized without covering or overlapping with any of the main objects in a natural image. In particular, we generate two adversarial patches that cover only 3% and 1.5% of the pixels in the original image, while they do not cover the main objects in the natural image. Our experiments are performed using four pre-trained DNN models and the ImageNet dataset. We also examine the inaccurate results of the interpretation models through mask and heatmap visualization. The proposed adversarial attack method could be a reference for developing robust network interpretation models that are more reliable for the decision-making process of pre-trained DNN models.

show abstract

Fooling Network Interpretation in Image Classification

Cited by 59 publications

References 26 publications

Towards a Reliable Evaluation of Local Interpretation Methods

Towards a Reliable Evaluation of Local Interpretation Methods

Gradient-based Analysis of NLP Models is Manipulable

Robust Adversarial Attack Against Explainable Deep Classification Models Based on Adversarial Images With Different Patch Sizes and Perturbation Ratios

Contact Info

Product

Resources

About