To evaluate the trustworthiness of saliency maps for abnormality localization in medical imaging.
Materials and Methods:Using two large publicly available radiology datasets (SIIM-ACR Pneumothorax Segmentation and RSNA Pneumonia Detection), we quantified the performance of eight commonly used saliency map techniques in regards to their 1) localization utility (segmentation and detection), 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. We compared their performances versus baseline methods and localization network architectures, using area under the precision-recall curve (AUPRC) and structural similarity index (SSIM) as metrics.Results: All eight saliency map techniques fail at least one of the criteria and were inferior in performance compared to localization networks. For pneumothorax segmentation, the AUPRC ranged from 0.024-0.224, while a U-Net achieved a significantly superior AUPRC of 0.404 (p<0.005). For pneumonia detection, the AUPRC ranged from 0.160-0.519, while a RetinaNet achieved a significantly superior AUPRC of 0.596 (p<0.005). Five and two saliency methods (out of eight) failed the model randomization test on the segmentation and detection datasets, respectively, suggesting that these methods are not sensitive to changes in model parameters. The repeatability and reproducibility of the majority of the saliency methods were worse than localization networks for both the segmentation and detection datasets.
Conclusion:We suggest that the use of saliency maps in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network.
Saliency maps have become a widely used method to make deep learning models more interpretable by providing post-hoc explanations of classifiers through identification of the most pertinent areas of the input medical image. They are increasingly being used in medical imaging to provide clinically plausible explanations for the decisions the neural network makes. However, the utility and robustness of these visualization maps has not yet been rigorously examined in the context of medical imaging. We posit that trustworthiness in this context requires 1) localization utility, 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. Using the localization information available in two large public radiology datasets, we quantify the performance of eight commonly used saliency map approaches for the above criteria using area under the precision-recall curves (AUPRC) and structural similarity index (SSIM), comparing their performance to various baseline measures. Using our framework to quantify the trustworthiness of saliency maps, we show that all eight saliency map techniques fail at least one of the criteria and are, in most cases, less trustworthy when compared to the baselines. We suggest that their usage in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network.
Purpose
To develop an automated measure of COVID-19 pulmonary disease severity on chest radiographs (CXRs), for longitudinal disease evaluation and clinical risk stratification.
Materials and Methods
A convolutional Siamese neural network-based algorithm was trained to output a measure of pulmonary disease severity on anterior-posterior CXRs (pulmonary x-ray severity (PXS) score), using weakly-supervised pretraining on ~160,000 images from CheXpert and transfer learning on 314 CXRs from patients with COVID-19. The algorithm was evaluated on internal and external test sets from different hospitals, containing 154 and 113 CXRs respectively. The PXS score was correlated with a radiographic severity score independently assigned by two thoracic radiologists and one in-training radiologist. For 92 internal test set patients with follow-up CXRs, the change in PXS score was compared to radiologist assessments of change. The association between PXS score and subsequent intubation or death was assessed.
Results
The PXS score correlated with the radiographic pulmonary disease severity score assigned to CXRs in the COVID-19 internal and external test sets (ρ=0.84 and ρ=0.78 respectively). The direction of change in PXS score in follow-up CXRs agreed with radiologist assessment (ρ=0.74). In patients not intubated on the admission CXR, the PXS score predicted subsequent intubation or death within three days of hospital admission (area under the receiver operator characteristic curve=0.80 (95%CI 0.75-0.85)).
Conclusion
A Siamese neural network-based severity score automatically measures COVID-19 pulmonary disease severity in chest radiographs, which can be scaled and rapidly deployed for clinical triage and workflow optimization.
Objective: We developed deep learning algorithms to automatically assess BI-RADS breast density.Methods: Using a large multi-institution patient cohort of 108,230 digital screening mammograms from the Digital Mammographic Imaging Screening Trial, we investigated the effect of data, model, and training parameters on overall model performance and provided crowdsourcing evaluation from the attendees of the ACR 2019 Annual Meeting.
We describe an alternative intubation technique using a rigid nasendoscope and a video camera monitor system in two infants with Pierre-Robin sequence presenting for palatoplasty. After induction with an inhalational anaesthetic technique, the tracheas of the infants could not be intubated with direct laryngoscopy using a Wisconsin blade. In the absence of a flexible paediatric fibrescope, a rigid endoscope (2.7 mm, 70 degrees lateral illumination) was passed orally to provide a view of the glottis on the monitor screen. A tracheal tube, bent into a J-shape using a stylet, was inserted orally and manipulated into the trachea, under video guidance. This technique proved to be simple, permitting a favourable view of the glottis. It should be considered for passing a tracheal tube through the vocal cords in infants who present with a difficult airway.
Model brittleness is a key concern when deploying deep learning models in real-world medical settings. A model that has high performance at one institution may suffer a significant decline in performance when tested at other institutions. While pooling datasets from multiple institutions and re-training may provide a straightforward solution, it is often infeasible and may compromise patient privacy. An alternative approach is to fine-tune the model on subsequent institutions after training on the original institution. Notably, this approach degrades model performance at the original institution, a phenomenon known as catastrophic forgetting. In this paper, we develop an approach to address catastrophic forgetting based on elastic weight consolidation combined with modulation of batch normalization statistics under two scenarios: first, for expanding the domain from one imaging system's data to another imaging system's, and second, for expanding the domain from a large multi-institutional dataset to another single institution dataset. We show that our approach outperforms several other state-of-the-art approaches and provide theoretical justification for the efficacy of batch normalization modulation. The results of this study are generally applicable to the deployment of any clinical deep learning model which requires domain expansion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.