Metrics for saliency map evaluation of deep learning explanation methods

Gomez, Tristan; Fréour, Thomas; Mouchère, Harold

doi:10.48550/arxiv.2201.13291

Cited by 4 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same issue of unnatural inputs was raised by Mase et al (2019). Gomez et al (2022) also point out that insertion and deletion tests only compare the rankings of the inputs. We have chosen to study insertion and deletion tests because they avoid the prohibitive cost of retraining.…”

Section: Related Workmentioning

confidence: 87%

See 1 more Smart Citation

Deletion and Insertion Tests in Regression Models

Hama¹,

Mase²,

Owen³

2022

Preprint

View full text Add to dashboard Cite

A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function f . The insertion and deletion tests of Petsiuk et al. (2018) are used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of f . We find an expression for the expected value of the AUC under a random ordering of inputs to f and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS). Exact computation of KS grows exponentially with dimension, while that of IG grows linearly with dimension. In two data sets including binary variables we find that KS is superior to IG in insertion and deletion tests, but only by a very small amount. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels. We show that IG will match KS when f is an additive function plus a multilinear function of the variables. This includes a multilinear interpolation over the binary variables that would cause IG to have exponential cost in a naive implementation.

show abstract

Section: Related Workmentioning

confidence: 87%

“…The insertion and deletion tests we study have been criticized by Gomez et al (2022) who note that the synthesized images generated in these tests are unnatural and do not resemble the images on which the algorithms were trained. The same issue of unnatural inputs was raised by Mase et al (2019).…”

Section: Related Workmentioning

confidence: 99%

Deletion and Insertion Tests in Regression Models

Hama¹,

Mase²,

Owen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The Deletion metric [22,39,46] (↓ lower is better) measures the Area under the Curve (AUC) of the target-class probability as we zero out the top-N highest-attribution pixels at each step in the input image. That is, a faithful AM is expected to have a lower AUC in Deletion.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…That is, a faithful AM is expected to have a lower AUC in Deletion. For the Insertion metric [22,39,46] (↑ higher is better) we start from a zero image and add top-N highest-attribution pixels at each step until recovering the original image and calculate the AUC of the probability curve. For both Deletion and Insertion, We use the implementation by [38] and N = 448 at each step.…”

Section: Evaluation Metricsmentioning

confidence: 99%

How explainable are adversarially-robust CNNs?

Nourelahi¹,

Kotthoff²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs that have a stronger out-of-distribution performance have also stronger explainability? Furthermore, most prior feature-importance studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first, large-scale evaluation of the relations of the three criteria using 9 feature-importance methods and 12 ImageNet-trained CNNs that are of 3 training algorithms and 5 CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate more than both vanilla and robust models alone, are not superior in explainability. Third, among 9 feature attribution methods tested, GradCAM and RISE are consistently the best methods. Fourth, Insertion and Deletion are biased towards vanilla and robust models respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which interestingly suggests that CNNs are harder to interpret as they become more accurate.Preprint. Under review.

show abstract

“…However, the financial cost and the difficulty of establishing a correct protocol make this approach difficult. Because of these issues, another trend focuses on designing objective metrics to evaluate generic explanation methods [25,18,5,12]. In this paper, we follow this trend and study the behavior of objective faithfulness metrics recently proposed applied to the problem of embryo stage identification.…”

Section: Introductionmentioning

confidence: 99%

Comparison of attention models and post-hoc explanation methods for embryo stage identification: a case study

Gomez¹,

Fréour²,

Mouchère³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

An important limitation to the development of AI-based solutions for In Vitro Fertilization (IVF) is the black-box nature of most state-of-the-art models, due to the complexity of deep learning architectures, which raises potential bias and fairness issues. The need for interpretable AI has risen not only in the IVF field but also in the deep learning community in general. This has started a trend in literature where authors focus on designing objective metrics to evaluate generic explanation methods. In this paper, we study the behavior of recently proposed objective faithfulness metrics applied to the problem of embryo stage identification. We benchmark attention models and post-hoc methods using metrics and further show empirically that (1) the metrics produce low overall agreement on the model ranking and (2) depending on the metric approach, either post-hoc methods or attention models are favored. We conclude with general remarks about the difficulty of defining faithfulness and the necessity of understanding its relationship with the type of approach that is favored.

show abstract

Metrics for saliency map evaluation of deep learning explanation methods

Cited by 4 publications

References 18 publications

Deletion and Insertion Tests in Regression Models

Deletion and Insertion Tests in Regression Models

How explainable are adversarially-robust CNNs?

Comparison of attention models and post-hoc explanation methods for embryo stage identification: a case study

Contact Info

Product

Resources

About