2021
DOI: 10.1101/2021.04.07.21254975
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On evaluation metrics for medical applications of artificial intelligence

Abstract: Clinicians and model developers need to understand how proposed machine learning (ML) models could improve patient care. In fact, no single metric captures all the desirable properties of a model and several metrics are typically reported to summarize a model's performance. Unfortunately, these measures are not easily understandable by many clinicians. Moreover, comparison of models across studies in an objective manner is challenging, and no tool exists to compare models using the same performance metrics. Th… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(31 citation statements)
references
References 19 publications
(32 reference statements)
1
30
0
Order By: Relevance
“…This is consistent with another surprising finding that ABNN prospective prediction sensitivity (recall) and ABNN prospective prediction precision are strongly and positively correlated across validation subgroups (Exact Pearson r = 0.808; Fig. 3E), in stark contrast to the typical inverse relationship between precision and recall for binary classifications (20). This finding reveals that ABNN performance is primarily driven by improved disease understanding that unbiasedly reduces both false positives and false negatives, rather than by biased cutoff tuning that trades off false positives against false negatives.…”
Section: Resultssupporting
confidence: 91%
See 1 more Smart Citation
“…This is consistent with another surprising finding that ABNN prospective prediction sensitivity (recall) and ABNN prospective prediction precision are strongly and positively correlated across validation subgroups (Exact Pearson r = 0.808; Fig. 3E), in stark contrast to the typical inverse relationship between precision and recall for binary classifications (20). This finding reveals that ABNN performance is primarily driven by improved disease understanding that unbiasedly reduces both false positives and false negatives, rather than by biased cutoff tuning that trades off false positives against false negatives.…”
Section: Resultssupporting
confidence: 91%
“…Accordingly, we chose to report prospective prediction sensitivity (recall) in the context of F1 score, and prospective prediction precision against the expert benchmark as the most appropriate performance metrics for PROTOCOLS validation test ( 19 ) to reflect the descending priorities (maximize true positives > minimize false negatives > minimize false positives > maximize true negatives) in actual clinical settings to deliver maximal clinical benefit to patients at minimal clinical cost to sponsors (Methods section ‘Metrics definition’ and ‘Model training’) ( 20, 21 ). We also report other standard measures (accuracy, specificity, etc.)…”
mentioning
confidence: 99%
“…In minority-event detection, metrics such as sensitivity and specificity (i.e. true positive and true negative rates) are often preferred depending on the relative importance of Type I and Type II errors in the given medical context [38]. To simplify presentation of results, we report the Balanced Accuracy, an average of specificity and sensitivity.…”
Section: B Ontology Of Methodsmentioning
confidence: 99%
“…48, reporting of a single metric such as Sensitivity, PPV or Specificity can be highly misleading because, for example, non-informative classifiers can achieve high values on imbalanced classes. The F 1 Score (also known as DSC in the context of segmentation), overcomes this issue by representing the harmonic mean of PPV and Sensitivity and therefore penalizing extreme values of either metric [38], while being relatively robust against imbalanced data sets [89]. The F 1 Score is a specification of the F𝛽 score, which adds a weighting between PPV and Sensitivity.…”
Section: G1 Image-level Classificationmentioning
confidence: 99%