Addressing bias in prediction models by improving subpopulation calibration

Barda, Noam; Yona, Gal; Rothblum, Guy N.; Greenland, Philip; Leibowitz, Morton; Balicer, Ran D.; Bachmat, Eitan; Dagan, Noa

doi:10.1093/jamia/ocaa283

Cited by 37 publications

(37 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Differences in these two metrics have been employed in recent studies for bias analysis. 16,17 FNR quantifies the rate at which patients with the observed outcome of death were misclassified. Thus, a high FNR for the score may lead to an increase in undertreatment, and high DisparityFNR (in absolute value) highlights large differences in such undertreatment across groups.…”

Section: Discussionmentioning

confidence: 99%

“…Hence, in assessment of generalizability, we add another set of measures to our analyses which we refer to as "fairness" metrics, following the algorithmic fairness literature. [15][16][17] Such performance checks are important, especially given evidence on racial bias in medical decision-making tools. 4,13,14 The primary objective of this study is to evaluate the external validity of predictive models for clinical decision making across hospitals and geographies in terms of the metrics -predictive discrimination (area under the receiver operating characteristic curve), calibration (calibration slope), 18 and algorithmic fairness (disparity in false negative rates and disparity in calibration slopes).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Singh

Mhasawade

Chunara

2021

Preprint

View full text Add to dashboard Cite

Importance: Modern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability. Objectives: To investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures. Design, Setting, and Participants: In this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015. Main Outcomes and Measures: The main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables. Results: In-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region. Conclusions and Relevance: Group-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Singh

Mhasawade

Chunara

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Fairness has been defined in various ways considering different contexts or applications, two of them are the most widely leveraged for bias detection and correction: Equal Opportunity, where the predictions are required to have equal true positive rate across two demographics, and Equalized Odds, where an additional constraint is put on the predictor to have equal false positive rate 43 . To derive fair decisions with machine learning algorithms, three categories of approaches have been proposed to mitigate biases 42,44 : 1)Pre-processing: the original dataset is transformed so that the underlying discrimination towards some groups is removed 45 ; 2) In-processing: either by adding a penalization term in the objective function 46 or imposing a fairness-relevant constraint 47 ; 3) Post-processing: further recompute the results from predictors to improve fairness 48 .…”

Section: Bias and Fairness In Machine Learningmentioning

confidence: 99%

MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning Models on MIMIC-IV Dataset

Meng

Trinh

et al. 2021

Preprint

View full text Add to dashboard Cite

The recent release of large-scale healthcare datasets has greatly propelled the research of data-driven deep learning models for healthcare applications. However, due to the nature of such deep black-boxed models, concerns about interpretability, fairness, and biases in healthcare scenarios where human lives are at stake call for a careful and thorough examination of both datasets and models. In this work, we focus on MIMIC-IV (Medical Information Mart for Intensive Care, version IV), the largest publicly available healthcare dataset, and conduct comprehensive analyses of dataset representation bias as well as interpretability and prediction fairness of deep learning models for in-hospital mortality prediction. In terms of interpretability, we observe that (1) the best-performing interpretability method successfully identifies critical features for mortality prediction on various prediction models; (2) demographic features are important for prediction. In terms of fairness, we observe that (1) there exists disparate treatment in prescribing mechanical ventilation among patient groups across ethnicity, gender and age; (2) all of the studied mortality predictors are generally fair while the IMV-LSTM (Interpretable Multi-Variable Long Short-Term Memory) model provides the most accurate and unbiased predictions across all protected groups. We further draw concrete connections between interpretability methods and fairness metrics by showing how feature importance from interpretability methods can be beneficial in quantifying potential disparities in mortality predictors.

show abstract

“…5 Some potential causes of this chasm are that current models are not useful, 4,6,7 reliable, 8,9 or fair. [10][11][12][13][14][15][16][17][18] Nevertheless, predictive models have been deployed in healthcare settings without transparency or independent validation, 19,20 and their subsequent failures have been met with public outcry. 2,[21][22][23] Adhering to model reporting guidelines is one way to improve the usefulness, [24][25][26][27][28] fairness, 29,30 and reliability 27,[31][32][33][34] of clinical predictive models.…”

Section: Introductionmentioning

confidence: 99%

Low adherence to existing model reporting guidelines by commonly used clinical prediction models

Callahan

Patel

et al. 2021

Preprint

View full text Add to dashboard Cite

Objective: To assess whether the documentation available for commonly used machine learning models developed by an electronic health record (EHR) vendor provides information requested by model reporting guidelines. Materials and Methods: We identified items requested for reporting from model reporting guidelines published in computer science, biomedical informatics, and clinical journals, and merged similar items into representative "atoms". Four independent reviewers and one adjudicator assessed the degree to which model documentation for 12 models developed by Epic Systems reported the details requested in each atom. We present summary statistics of consensus, interrater agreement, and reporting rates of all atoms for the 12 models. Results: We identified 220 unique atoms across 15 model reporting guidelines. After examining the documentation for the 12 most commonly used Epic models, the independent reviewers had an interrater agreement of 76%. After adjudication, the model documentations' median completion rate of applicable atoms was 39% (range: 31%-47%). Most of the commonly requested atoms had reporting rates of 90% or above, including atoms concerning outcome definition, preprocessing, AUROC, internal validation and intended clinical use. For individual reporting guidelines, the median adherence rate for an entire guideline was 54% (range: 15%-71%). Atoms reported half the time or less included those relating to fairness (summary statistics and subgroup analyses, including for age, race/ethnicity, or sex), usefulness (net benefit, prediction time, warnings on out-of-scope use and when to stop use), and transparency (model coefficients). Atoms reported the least often related to missingness (missing data statistics, missingness strategy), validation (calibration plot, external validation), and monitoring (how models are updated/tuned, prediction monitoring). Conclusion: There are many recommendations about what should be reported about predictive models used to guide care. Existing model documentation examined in this study provides less than half of applicable atoms, and entire reporting guidelines have low adherence rates. Half or less of the reviewed documentation reported information related to usefulness, reliability, transparency and fairness of models. There is a need for better operationalization of reporting recommendations for predictive models in healthcare.

show abstract

Addressing bias in prediction models by improving subpopulation calibration

Cited by 37 publications

References 27 publications

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning Models on MIMIC-IV Dataset

Low adherence to existing model reporting guidelines by commonly used clinical prediction models

Contact Info

Product

Resources

About