Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we question whether striving to achieve zero disparities in predictive performance (i.e. group fairness) is the appropriate fairness definition in the clinical setting, over minimax fairness, which focuses on maximizing the performance of the worstcase group. We benchmark the performance of nine methods in improving classifier fairness across these two definitions. We find, consistent with prior work on non-clinical data, that methods which strive to achieve better worstgroup performance do not outperform simple data balancing. We also find that methods which achieve group fairness do so by worsening performance for all groups. In light of these results, we discuss the utility of fairness definitions in the clinical setting, advocating for an investigation of the bias-inducing mechanisms in the underlying data generating process whenever possible.
Data and Code AvailabilityWe make use of two chest x-ray datasets: MIMIC-CXR (Johnson et al., 2019) and CheXpert (Irvin et al., 2019). Both datasets are publicly available pending appropriate data usage agreements. Demographic data for pa-tients in MIMIC-CXR were obtained from MIMIC-IV (Johnson et al., 2021), available through Phy-sioNet (Goldberger et al., 2000). We analyze an additional radiologist-labelled dataset in this paper. We recruit a board-certified radiologist co-author to manually label 1,200 reports in MIMIC-CXR which have been labelled as No Finding by the CheXpert labeller, an automatic rule-based NLP model (Irvin et al., 2019). This dataset, along with code to reproduce our results, can be found at https: //github.com/MLforHealth/CXR_Fairness.