Network output visualization to uncover limitations of deep learning detection of pneumothorax

Crosby, Jennie; Chen, Sophia; Li, Feng; MacMahon, Heber; Giger, Maryellen L.

doi:10.1117/12.2550066

Cited by 8 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to being used for clinical interpretation, saliency method heat maps are also used for the evaluation of CXR interpretation models, for quality improvement (QI) and quality assurance (QA) in clinical practices, and for dataset annotation 51 . However, we found that saliency method localization performance, on balance, performed worse than expert localization across multiple analyses and across many important pathologies (our findings are consistent with recent work focused on localizing a single pathology, Pneumothorax, in CXRs 52 ). If used in clinical practice, heat maps that incorrectly highlight medical images may exacerbate well documented biases (chiefly, automation bias) and erode trust in model predictions (even when model output is correct), limiting clinical translation 22 .…”

Section: Discussionsupporting

confidence: 88%

“…Our work has several potential implications for patient care. Heat maps generated using saliency methods are advocated as clinical decision support in the hope that the heat maps not only improve clinical decision-making, but also encourage clinicians to trust model predictions [32][33][34] . However, we found that AI localization performance, on balance, how we might improve saliency methods in the future.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking saliency methods for chest X-ray interpretation

Saporta

Gui

Agrawal

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning has enabled automated medical image interpretation at a level often surpassing that of practicing medical experts. However, many clinical practices have cited a lack of model interpretability as reason to delay the use of "black-box" deep neural networks in clinical workflows. Saliency maps, which "explain" a model's decision by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. In this work, we demonstrate that the most commonly used saliency map generating method, Grad-CAM, results in low performance for 10 pathologies on chest X-rays. We examined under what clinical conditions saliency maps might be more dangerous to use compared to human experts, and found that Grad-CAM performs worse for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex. Moreover, we showed that model confidence was positively correlated with Grad-CAM localization performance, suggesting that saliency maps were safer for clinicians to use as a decision aid when the model had made a positive prediction with high confidence. Our work demonstrates that several important limitations of interpretability techniques for medical imaging must be addressed before use in clinical workflows.

show abstract

Section: Discussionsupporting

confidence: 88%

Section: Discussionmentioning

confidence: 99%

Benchmarking saliency methods for chest X-ray interpretation

Saporta

Gui

Agrawal

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In general, the particular localization tasks presented in this paper can be incredibly difficult due to the overlapping structures present in 2D chest radiographs, as well as subtle changes in texture that can be challenging to detect [17]. The challenges of the pneumothorax and pneumonia chest radiograph datasets serve to demonstrate the limitations on localization abilities of saliency maps.…”

Section: Supplementary Materialsmentioning

confidence: 99%

“…The study found that models of similar accuracy produced different explanations, and GradCAM would even obscure most of the lesion of interest causing any explanation for melanoma classification to be clinically useless. Additionally, only two studies in the medical domain assessed saliency maps’ localization capabilities using some ground-truth measure, such as bounding boxes or semantic segmentation [17,45]. However, in Crosby et al 2020 there was no quantification of the extent of overlap (utility) of GradCAM with the relevant image regions, but rather a binary measure of whether or not GradCAM’s region of highest activation intersected the pneumothorax region.…”

mentioning

confidence: 99%

“…One study used 3-D CNNs for Alzheimer's Disease classification with magnetic resonance images, and after testing several different models and saliency maps, found that saliency methods varied in robustness in regards to repeated model training [15]. Another study for detection of pneumothorax images using a VGG19 neural network found that only 33% of GradCAM [16] heatmaps for correctly classified pneumothorax images overlapped the correct regions of the pneumothorax with high probability [17]. Young et al trained an Inception neural network model to differentiate benign skin lesions from melanoma on dermoscopy images and examined the performance of saliency maps generated by GradCAM and Kernel-SHAP [18].…”

mentioning

confidence: 99%

See 1 more Smart Citation

Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging

Arun

Gaw

Singh

et al. 2020

Preprint

View full text Add to dashboard Cite

Saliency maps have become a widely used method to make deep learning models more interpretable by providing post-hoc explanations of classifiers through identification of the most pertinent areas of the input medical image. They are increasingly being used in medical imaging to provide clinically plausible explanations for the decisions the neural network makes. However, the utility and robustness of these visualization maps has not yet been rigorously examined in the context of medical imaging. We posit that trustworthiness in this context requires 1) localization utility, 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. Using the localization information available in two large public radiology datasets, we quantify the performance of eight commonly used saliency map approaches for the above criteria using area under the precision-recall curves (AUPRC) and structural similarity index (SSIM), comparing their performance to various baseline measures. Using our framework to quantify the trustworthiness of saliency maps, we show that all eight saliency map techniques fail at least one of the criteria and are, in most cases, less trustworthy when compared to the baselines. We suggest that their usage in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network.

show abstract

Clinical Explainability Failure (CEF) & Explainability Failure Ratio (EFR) – Changing the Way We Validate Classification Algorithms

Venugopal¹,

Takhar²,

Gupta³

et al. 2022

J Med Syst

View full text Add to dashboard Cite

Adoption of Artificial Intelligence (AI) algorithms into the clinical realm will depend on their inherent trustworthiness, which is built not only by robust validation studies but is also deeply linked to the explainability and interpretability of the algorithms. Most validation studies for medical imaging AI report performance of algorithms on study-level labels and lay little emphasis on measuring the accuracy of explanations generated by these algorithms in the form of heat maps or bounding boxes, especially in true positive cases. We propose a new metric -Explainability Failure Ratio (EFR)derived from Clinical Explainability Failure (CEF) to address this gap in AI evaluation. We define an Explainability Failure as a case where the classification generated by an AI algorithm matches with study-level ground truth but the explanation output generated by the algorithm is inadequate to explain the algorithms output. We measured EFR for two algorithms that automatically detect consolidation on chest X-rays to determine the applicability of the metric and observed a lower EFR for the model that had lower sensitivity for identifying consolidation on chest X-rays, implying that trustworthiness of a model should be determined not only by routine statistical metrics but also by novel 'clinically-oriented' models.

show abstract

Network output visualization to uncover limitations of deep learning detection of pneumothorax

Cited by 8 publications

References 0 publications

Benchmarking saliency methods for chest X-ray interpretation

Benchmarking saliency methods for chest X-ray interpretation

Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging

Clinical Explainability Failure (CEF) & Explainability Failure Ratio (EFR) – Changing the Way We Validate Classification Algorithms

Contact Info

Product

Resources

About