Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

Daneshjou, Roxana; Novoa, Roberto A.; Jenkins, Melissa; Liang, Wei; Rotemberg, Veronica; Ko, Justin; Swetter, Susan M.; Bailey, Elizabeth E.; Gevaert, Olivier; Mukherjee, Pritam; Phung, Michelle; Yekrang, Kiana; Fong, Bradley; Sahasrabudhe, Rachna; Allerup, Johan A. C.; Okata-Karigane, Utako; Zou, James; Chiou, Albert S.

doi:10.48550/arxiv.2203.08807

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This list of dimensions largely depends on the context and the degree to which data are subjective, representative, and missing (Mullainathan & Obermeyer, 2017). Recent examples of important contextual dimensions on machine learning tasks include skin color in face recognition (Buolamwini & Gebru, 2018) and dermatology diagnosis (Groh et al, 2021;Daneshjou et al, 2022), background scenery for affect recognition (Kosti et al, 2019), number of people in a video for deepfake detection (Groh et al, 2022a), number of chronic illnesses for algorithmic healthcare risk prediction (Obermeyer et al, 2019), data artifacts like surgical markings (Winkler et al, 2019) or clinically irrelevant labels (Oakden-Rayner et al, 2020) for medical diagnosis classification, and patients' self reports of pain for quantifying severity of knee osteoarthritis (Pierson et al, 2021). Helpful questions that may guide the identification of potential context shifts in complex, human-centered machine learning applications include (and are not limited to): who are represented in the data and as annotators of the data, when and where is the data collected, how do social, geographical, temporal, technological, aesthetic, financial incentives and other idiosyncrasies influence the creation of the data, and why the data is curated as it is.…”

Section: Contextualizing the Benchmark-production Gapmentioning

confidence: 99%

Identifying the Context Shift between Test Benchmarks and Production Data

Groh¹

2022

Preprint

View full text Add to dashboard Cite

Across a wide variety of domains, there exists a performance gap between machine learning models' accuracy on dataset benchmarks and realworld production data. Despite the careful design of static dataset benchmarks to represent the real-world, models often err when the data is out-of-distribution relative to the data the models have been trained on. We can directly measure and adjust for some aspects of distribution shift, but we cannot address sample selection bias, adversarial perturbations, and non-stationarity without knowing the data generation process. In this paper, we outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors: leveraging human intuition and expert knowledge to identify firstorder contexts and developing dynamic benchmarks based on desiderata for the data generation process. Furthermore, we present two casestudies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors when attempting to generalize beyond test benchmark datasets. By paying close attention to the role of context in each prediction task, researchers can reduce context shift errors and increase generalization performance.

show abstract

Section: Contextualizing the Benchmark-production Gapmentioning

confidence: 99%

Identifying the Context Shift between Test Benchmarks and Production Data

Groh¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

Cited by 1 publication

References 0 publications

Identifying the Context Shift between Test Benchmarks and Production Data

Identifying the Context Shift between Test Benchmarks and Production Data

Contact Info

Product

Resources

About