Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem

Cruz, Beatriz García Santa; Bossa, Matías; Sölter, Jan; Husch, Andreas

doi:10.1016/j.media.2021.102225

Cited by 50 publications

(31 citation statements)

References 82 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Sheykhivand et al (2021) , generative adversarial network (GAN) is employed to generate CXR images of COVID-19 category, which has enlarged the sample capacity to nearly 7 times and facilitated more robust feature learning. Notice that the relatively less COVID-19 training samples can lead to model bias, and aiming at this phenomenon, work Garcia Santa Cruz et al (2021) has presented a systematic inspection on public COVID-19 X-ray imaging datasets and provided effective guidance accordingly.…”

Section: Introductionmentioning

confidence: 99%

Cov-Net: A computer-aided diagnosis method for recognizing COVID-19 from chest X-ray images via machine vision

Zeng²,

Wu³

et al. 2022

Expert Systems with Applications

View full text Add to dashboard Cite

Section: Introductionmentioning

confidence: 99%

Cov-Net: A computer-aided diagnosis method for recognizing COVID-19 from chest X-ray images via machine vision

Zeng²,

Wu³

et al. 2022

Expert Systems with Applications

View full text Add to dashboard Cite

“…Many studies use data from sources with minimal provenance and metadata, and often use data that was not intended for training diagnostic or prognostic tools. A number of datasets aggregate data from different sources, some of which may be aggregates themselves [ 9 ]; and many studies aggregate a number of datasets, either to increase their training size or to provide an independent test set. However, this causes a complex set or participants and leads to a high risk that the same images are present in the training and evaluation set.…”

Section: Discussionmentioning

confidence: 99%

“…Due to reports of a high risk-of-bias in the field [ 9 , 13 , 15 ], we include a bias assessment. Improper study design, data collection, data partitioning and statistical methods can lead to misleading reported results [ 14 ].…”

Section: Methodsmentioning

confidence: 99%

“…The percentage of total studies that failed each of the required subset of the CLAIM checklist for inclusion (left), and a histogram of the number of failures (right), where only studies with 0 failures met the inclusion criteria Fig. 3 CLAIM results of studies included: the number of included studies that failed each of the CLAIM items (left), and a histogram of the number of failures (right) may be aggregates themselves [9]; and many studies aggregate a number of datasets, either to increase their training size or to provide an independent test set. However, this causes a complex set or participants and leads to a high risk that the same images are present in the training and evaluation set.…”

Section: Datasetsmentioning

confidence: 99%

“…DeGrave et al [ 8 ] demonstrated that combining data from multiple sources, in particular where data from different classes have different acquisition and pre-processing parameters, led to a significant bias that artificially improved the measured performance in many studies. Garcia Santa Cruz et al [ 9 ] presented a review of public CXR datasets, concluding that the most popular datasets used in the literature were at a high risk of introducing bias into reported results.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Automated COVID-19 diagnosis and prognosis with medical imaging and who is publishing: a systematic review

et al. 2021

View full text Add to dashboard Cite

Objectives: To conduct a systematic survey of published techniques for automated diagnosis and prognosis of COVID-19 diseases using medical imaging, assessing the validity of reported performance and investigating the proposed clinical use-case. To conduct a scoping review into the authors publishing such work. Methods: The Scopus database was queried and studies were screened for article type, and minimum source normalized impact per paper and citations, before manual relevance assessment and a bias assessment derived from a subset of the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). The number of failures of the full CLAIM was adopted as a surrogate for risk-of-bias. Methodological and performance measurements were collected from each technique. Each study was assessed by one author. Comparisons were evaluated for significance with a two-sided independent t-test. Findings: Of 1002 studies identified, 390 remained after screening and 81 after relevance and bias exclusion. The ratio of exclusion for bias was 71%, indicative of a high level of bias in the field. The mean number of CLAIM failures per study was 8.3 ± 3.9 [1,17] (mean ± standard deviation [min,max]). 58% of methods performed diagnosis versus 31% prognosis. Of the diagnostic methods, 38% differentiated COVID-19 from healthy controls. For diagnostic techniques, area under the receiver operating curve (AUC) = 0.924 ± 0.074 [0.810,0.991] and accuracy = 91.7% ± 6.4 [79.0,99.0]. For prognostic techniques, AUC = 0.836 ± 0.126 [0.605,0.980] and accuracy = 78.4% ± 9.4 [62.5,98.0]. CLAIM failures did not correlate with performance, providing confidence that the highest results were not driven by biased papers. Deep learning techniques reported higher AUC (p < 0.05) and accuracy (p < 0.05), but no difference in CLAIM failures was identified. Interpretation: A majority of papers focus on the less clinically impactful diagnosis task, contrasted with prognosis, with a significant portion performing a clinically unnecessary task of differentiating COVID-19 from healthy. Authors should consider the clinical scenario in which their work would be deployed when developing techniques. Nevertheless, studies report superb performance in a potentially impactful application. Future work is warranted in translating techniques into clinical tools. Supplementary Information The online version contains supplementary material available at 10.1007/s13246-021-01093-0.

show abstract

Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning – Success story or dataset bias?

2022

View full text Add to dashboard Cite

Purpose: Over the last 2 years, the artificial intelligence (AI) community has presented several automatic screening tools for coronavirus disease 2019 based on chest radiography (CXR), with reported accuracies often well over 90%. However, it has been noted that many of these studies have likely suffered from dataset bias, leading to overly optimistic results. The purpose of this study was to thoroughly investigate to what extent biases have influenced the performance of a range of previously proposed and promising convolutional neural networks (CNNs), and to determine what performance can be expected with current CNNs on a realistic and unbiased dataset. Methods: Five CNNs for COVID-19 positive/negative classification were implemented for evaluation, namely VGG19, ResNet50, InceptionV3, DenseNet201, and COVID-Net. To perform both internal and cross-dataset evaluations, four datasets were created. The first dataset Valencian Region Medical Image Bank (BIMCV) followed strict reverse transcriptase-polymerase chain reaction (RT-PCR) test criteria and was created from a single reliable open access databank, while the second dataset (COVIDxB8) was created through a combination of six online CXR repositories. The third and fourth datasets were created by combining the opposing classes from the BIMCV and COVIDxB8 datasets. To decrease inter-dataset variability, a pre-processing workflow of resizing, normalization, and histogram equalization were applied to all datasets. Classification performance was evaluated on unseen test sets using precision and recall. A qualitative sanity check was performed by evaluating saliency maps displaying the top 5%, 10%, and 20% most salient segments in the input CXRs, to evaluate whether the CNNs were using relevant information for decision making. In an additional experiment and to further investigate the origin of potential dataset bias, all pixel values outside the lungs were set to zero through automatic lung segmentation before training and testing. Results: When trained and evaluated on the single online source dataset (BIMCV), the performance of all CNNs is relatively low (precision: 0.65-0.72, recall: 0.59-0.71), but remains relatively consistent during external evaluation (precision: 0.58-0.82, recall: 0.57-0.72). On the contrary, when trained and internally evaluated on the combinatory datasets, all CNNs performed well across all metrics (precision: 0.94-1.00, recall: 0.77-1.00). However, when subsequently evaluated cross-dataset, results dropped substantially (precision: 0.10-0.61, recall: 0.04-0.80). For all datasets, saliency maps revealed the CNNs rarelyThis is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

show abstract

Public Covid-19 X-ray datasets and their impact on model bias – A systematic review of a significant problem

Abstract: Graphical abstract

Cited by 50 publications

References 82 publications

Cov-Net: A computer-aided diagnosis method for recognizing COVID-19 from chest X-ray images via machine vision

Cov-Net: A computer-aided diagnosis method for recognizing COVID-19 from chest X-ray images via machine vision

Automated COVID-19 diagnosis and prognosis with medical imaging and who is publishing: a systematic review

Automatic coronavirus disease 2019 diagnosis based on chest radiography and deep learning – Success story or dataset bias?

Contact Info

Product

Resources

About