2022
DOI: 10.1038/s41598-022-16514-7
|View full text |Cite
|
Sign up to set email alerts
|

Clinically focused multi-cohort benchmarking as a tool for external validation of artificial intelligence algorithm performance in basic chest radiography analysis

Abstract: Artificial intelligence (AI) algorithms evaluating [supine] chest radiographs ([S]CXRs) have remarkably increased in number recently. Since training and validation are often performed on subsets of the same overall dataset, external validation is mandatory to reproduce results and reveal potential training errors. We applied a multicohort benchmarking to the publicly accessible (S)CXR analyzing AI algorithm CheXNet, comprising three clinically relevant study cohorts which differ in patient positioning ([S]CXRs… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 39 publications
0
8
0
Order By: Relevance
“…Among the 63 studies, 56 studies identified pneumothorax on chest radiography [ 26 81 ], four studies on computed tomography [ 82 85 ], one study on ECG [ 86 ], one study used chest radiography and photography using a smartphone [ 87 ], and one study used chest radiography and tabular data [ 88 ]. Six studies developed and internally tuned DLs [ 37 , 52 , 63 , 67 , 74 , 76 ], 25 studies also internally tested their DLs [ 32 , 33 , 35 , 38 , 40 , 41 , 43 , 45 , 47 , 48 , 50 , 55 , 60 , 65 , 69 , 70 , 73 , 75 , 79 83 , 85 , 86 ] and 32 studies externally tested the DLs [ 26 31 , 34 , 36 , 39 , 42 , 44 , 46 , 49 , 51 , 53 , 54 , 56 59 , 61 , 62 , 64 , 66 , 68 , 71 , 72 , 77 , 78 , 84 , 87 , 88 ].…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Among the 63 studies, 56 studies identified pneumothorax on chest radiography [ 26 81 ], four studies on computed tomography [ 82 85 ], one study on ECG [ 86 ], one study used chest radiography and photography using a smartphone [ 87 ], and one study used chest radiography and tabular data [ 88 ]. Six studies developed and internally tuned DLs [ 37 , 52 , 63 , 67 , 74 , 76 ], 25 studies also internally tested their DLs [ 32 , 33 , 35 , 38 , 40 , 41 , 43 , 45 , 47 , 48 , 50 , 55 , 60 , 65 , 69 , 70 , 73 , 75 , 79 83 , 85 , 86 ] and 32 studies externally tested the DLs [ 26 31 , 34 , 36 , 39 , 42 , 44 , 46 , 49 , 51 , 53 , 54 , 56 59 , 61 , 62 , 64 , 66 , 68 , 71 , 72 , 77 , 78 , 84 , 87 , 88 ].…”
Section: Resultsmentioning
confidence: 99%
“…As for model development, to generate a reference standard for image labelling, 18 studies used expert consensus [ 27 33 , 35 38 , 49 , 53 55 , 71 , 77 , 83 ], two relied on the opinion of a single expert reader [ 76 , 85 ], 16 used pre-existing radiological reports or other imaging modalities [ 34 , 41 , 43 , 45 , 46 , 52 , 60 , 61 , 67 , 75 , 78 82 , 87 ], one study defined their reference standard as surgical confirmation (indicated for surgery) [ 86 ], 11 studies used mixed methods (any combination of the aforementioned) [ 40 , 47 , 48 , 50 , 51 , 62 , 63 , 65 , 69 , 70 , 73 ] and two studies did not report how their reference standard was generated [ 74 , 88 ]. As for model testing, to generate a reference standard for image labelling, 26 studies used expert consensus [ 26 28 , 30 33 , 38 , 39 , 44 , 51 , 54 57 , 61 , 64 , 66 , 68 , 71 73 , 77 , 80 , 83 , 84 ], two relied on the opinion of a single expert reader [ 58 , 85 ], 11 used pre-exist...…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The next most comprehensive model, which was capable of detecting 72 findings, demonstrated an average AUC of 0.77 [11]. When compared with physician detection accuracy, the identified devices were typically found to be as accurate, or more accurate, than radiologist or non-radiologist clinicians [11,43,59,63,71,74,[81][82][83]88]. Taking this further, multiple studies demonstrated that use of well-trained and validated deep learning models can improve the clinical finding classification performance of clini-cians when acting as a diagnostic assistance device [42,43,57,62,66,74,83,87].…”
Section: Discussionmentioning
confidence: 99%
“…Tam et al also reported the improved detection of suspicious pulmonary nodules on CXR with AI-aided interpretation (sensitivities 89–94%) versus unaided reporting interpretation for all three radiologists (sensitivities 69–86%), with a slight increase in false positives and a decrease in specificity [ 18 ]. Another CXR study reported that standalone AI performance for pneumothorax, pleural effusion and lung lesions was similar to that for radiology residents, but was significantly better than the performance of non-radiology residents [ 19 ]. Beyond CXRs, other studies have reported on missed findings of intracranial hemorrhage in noncontract head CT examinations and mammography [ 20 ].…”
Section: Introductionmentioning
confidence: 99%