Abstract:Artificial intelligence (AI) researchers and radiologists have recently reported AI systems that accurately detect COVID-19 in chest radiographs. However, the robustness of these systems remains unclear. Using state-of-the-art techniques in explainable AI, we demonstrate that recent deep learning systems to detect COVID-19 from chest radiographs rely on confounding factors rather than medical pathology, creating an alarming situation in which the systems appear accurate, but fail when tested in new hospitals. … Show more
“…The problem with lacking insight into model decisions appears when a model learns to predict accurately based on irrelevant features and thus generalizes poorly to other datasets. A recent study, for instance, reported that an accurate AI model trained to identify COVID-19 in chest radiographs actually failed to make use of the relevant information in the images [157]. Due to a consistent patient positioning during imaging, the model instead recognized COVID-19-positive patients based on their shoulder regions.…”
Section: Explaining Decisions Made By Ai Modelsmentioning
confidence: 99%
“…Moreover, by knowing the decision pattern, the clinician would be able to assess faithfulness of the prediction. For this reason, explainability is suggested to be an ethical requirement for future clinical decision systems [157]. Here, we introduce different explainable AI methods and give a brief overview of applications in medicine and analysis of molecular pathways.…”
Cardiovascular diseases (CVD) annually take almost 18 million lives worldwide. Most lethal events occur months or years after the initial presentation. Indeed, many patients experience repeated complications or require multiple interventions (recurrent events). Apart from affecting the individual, this leads to high medical costs for society. Personalized treatment strategies aiming at prediction and prevention of recurrent events rely on early diagnosis and precise prognosis. Complementing the traditional environmental and clinical risk factors, multi-omics data provide a holistic view of the patient and disease progression, enabling studies to probe novel angles in risk stratification. Specifically, predictive molecular markers allow insights into regulatory networks, pathways, and mechanisms underlying disease. Moreover, artificial intelligence (AI) represents a powerful, yet adaptive, framework able to recognize complex patterns in large-scale clinical and molecular data with the potential to improve risk prediction. Here, we review the most recent advances in risk prediction of recurrent cardiovascular events, and discuss the value of molecular data and biomarkers for understanding patient risk in a systems biology context. Finally, we introduce explainable AI which may improve clinical decision systems by making predictions transparent to the medical practitioner.
“…The problem with lacking insight into model decisions appears when a model learns to predict accurately based on irrelevant features and thus generalizes poorly to other datasets. A recent study, for instance, reported that an accurate AI model trained to identify COVID-19 in chest radiographs actually failed to make use of the relevant information in the images [157]. Due to a consistent patient positioning during imaging, the model instead recognized COVID-19-positive patients based on their shoulder regions.…”
Section: Explaining Decisions Made By Ai Modelsmentioning
confidence: 99%
“…Moreover, by knowing the decision pattern, the clinician would be able to assess faithfulness of the prediction. For this reason, explainability is suggested to be an ethical requirement for future clinical decision systems [157]. Here, we introduce different explainable AI methods and give a brief overview of applications in medicine and analysis of molecular pathways.…”
Cardiovascular diseases (CVD) annually take almost 18 million lives worldwide. Most lethal events occur months or years after the initial presentation. Indeed, many patients experience repeated complications or require multiple interventions (recurrent events). Apart from affecting the individual, this leads to high medical costs for society. Personalized treatment strategies aiming at prediction and prevention of recurrent events rely on early diagnosis and precise prognosis. Complementing the traditional environmental and clinical risk factors, multi-omics data provide a holistic view of the patient and disease progression, enabling studies to probe novel angles in risk stratification. Specifically, predictive molecular markers allow insights into regulatory networks, pathways, and mechanisms underlying disease. Moreover, artificial intelligence (AI) represents a powerful, yet adaptive, framework able to recognize complex patterns in large-scale clinical and molecular data with the potential to improve risk prediction. Here, we review the most recent advances in risk prediction of recurrent cardiovascular events, and discuss the value of molecular data and biomarkers for understanding patient risk in a systems biology context. Finally, we introduce explainable AI which may improve clinical decision systems by making predictions transparent to the medical practitioner.
“… 16 Learning these shortcuts instead of the underlying nature of the problem is a topic of concern in the field. 17 It is then comprehensible for many machine learning methods to spark criticism regarding the difficulty to understand the rationale behind their predictions. It has been questioned whether a pharmaceutical company would promote a given molecule into a portfolio based only on an opaque prediction made by a neural network, without any clear explanation to support it.…”
Deep learning has
been successfully applied to structure-based
protein–ligand affinity prediction, yet the black box nature
of these models raises some questions. In a previous study, we presented
K
DEEP
, a convolutional neural network that predicted the
binding affinity of a given protein–ligand complex while reaching
state-of-the-art performance. However, it was unclear what this model
was learning. In this work, we present a new application to visualize
the contribution of each input atom to the prediction made by the
convolutional neural network, aiding in the interpretability of such
predictions. The results suggest that K
DEEP
is able to
learn meaningful chemistry signals from the data, but it has also
exposed the inaccuracies of the current model, serving as a guideline
for further optimization of our prediction tools.
“…However, the field has inspired controversy. DeGrave et al [ 8 ] demonstrated that combining data from multiple sources, in particular where data from different classes have different acquisition and pre-processing parameters, led to a significant bias that artificially improved the measured performance in many studies. Garcia Santa Cruz et al [ 9 ] presented a review of public CXR datasets, concluding that the most popular datasets used in the literature were at a high risk of introducing bias into reported results.…”
Objectives: To conduct a systematic survey of published techniques for automated diagnosis and prognosis of COVID-19 diseases using medical imaging, assessing the validity of reported performance and investigating the proposed clinical use-case. To conduct a scoping review into the authors publishing such work. Methods: The Scopus database was queried and studies were screened for article type, and minimum source normalized impact per paper and citations, before manual relevance assessment and a bias assessment derived from a subset of the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). The number of failures of the full CLAIM was adopted as a surrogate for risk-of-bias. Methodological and performance measurements were collected from each technique. Each study was assessed by one author. Comparisons were evaluated for significance with a two-sided independent t-test. Findings: Of 1002 studies identified, 390 remained after screening and 81 after relevance and bias exclusion. The ratio of exclusion for bias was 71%, indicative of a high level of bias in the field. The mean number of CLAIM failures per study was 8.3 ± 3.9 [1,17] (mean ± standard deviation [min,max]). 58% of methods performed diagnosis versus 31% prognosis. Of the diagnostic methods, 38% differentiated COVID-19 from healthy controls. For diagnostic techniques, area under the receiver operating curve (AUC) = 0.924 ± 0.074 [0.810,0.991] and accuracy = 91.7% ± 6.4 [79.0,99.0]. For prognostic techniques, AUC = 0.836 ± 0.126 [0.605,0.980] and accuracy = 78.4% ± 9.4 [62.5,98.0]. CLAIM failures did not correlate with performance, providing confidence that the highest results were not driven by biased papers. Deep learning techniques reported higher AUC (p < 0.05) and accuracy (p < 0.05), but no difference in CLAIM failures was identified. Interpretation: A majority of papers focus on the less clinically impactful diagnosis task, contrasted with prognosis, with a significant portion performing a clinically unnecessary task of differentiating COVID-19 from healthy. Authors should consider the clinical scenario in which their work would be deployed when developing techniques. Nevertheless, studies report superb performance in a potentially impactful application. Future work is warranted in translating techniques into clinical tools.
Supplementary Information
The online version contains supplementary material available at 10.1007/s13246-021-01093-0.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.