Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
has been administrating a national cervical cancer screening program since 1992 by coordinating triennial cytology exam screenings for the female population between 25 and 69 years of age. Up to 80% of cancers are prevented through mass screening, but this comes at the expense of considerable screening activity and leads to overtreatment of clinically asymptomatic precancers. In this article, we present a continuous-time, time-inhomogeneous hidden Markov model which was developed to understand the screening process and cervical cancer carcinogenesis in detail. By leveraging 1.7 million individual's multivariate time-series of medical exams performed over a 25-year period, we simultaneously estimate all model parameters. We show that an age-dependent model reflects the Norwegian screening program by comparing empirical survival curves from observed registry data and data simulated from the proposed model. The model can be generalized to include more detailed individual-level covariates as well as new types of screening exams. By utilizing individual screening histories and covariate data, the proposed model shows potential for improving strategies for cancer screening programs by personalizing recommended screening intervals.
The combination of machine learning (ML) and electronic health records (EHR) data may be able to improve outcomes of hospitalized COVID-19 patients through improved risk stratification and patient outcome prediction. However, in resource constrained environments the clinical utility of such data-driven predictive tools may be limited by the cost or unavailability of certain laboratory tests. We leveraged EHR data to develop an ML-based tool for predicting adverse outcomes that optimizes clinical utility under a given cost structure. We further gained insights into the decision-making process of the ML models through an explainable AI tool. This cohort study was performed using deidentified EHR data from COVID-19 patients from ProMedica Health System in northwest Ohio and southeastern Michigan. We tested the performance of various ML approaches for predicting either increasing ventilatory support or mortality. We performed post hoc analysis to obtain optimal feature sets under various budget constraints. We demonstrate that it is possible to achieve a significant reduction in cost at the expense of a small reduction in predictive performance. For example, when predicting ventilation, it is possible to achieve a 43% reduction in cost with only a 3% reduction in performance. Similarly, when predicting mortality, it is possible to achieve a 50% reduction in cost with only a 1% reduction in performance. This study presents a quick, accurate, and cost-effective method to evaluate risk of deterioration for patients with SARS-CoV-2 infection at the time of clinical evaluation.
Background: Machine learning (ML) based risk stratification models of Electronic Health records (EHR) data may help to optimize treatment of COVID-19 patients, but are often limited by their lack of clinical interpretability and cost of laboratory tests. We develop a ML based tool for predicting adverse outcomes based on EHR data to optimize clinical utility under a given cost structure. This cohort study was performed using deidentified EHR data from COVID-19 patients from ProMedica Healthcare in northwest Ohio and southeastern Michigan. Methods: We tested performance of various ML approaches for predicting either increasing ventilatory support or mortality and the set of model features under a budget constraint was optimized via exhaustive search across all combinations of features. Results: The optimal sets of features for predicting ventilation under any budget constraint included demographics and comorbidities (DCM), basic metabolic panel (BMP), D-dimer, lactate dehydrogenase (LDH), erythrocyte sedimentation rate (ESR), CRP, brain natriuretic peptide (BNP), and procalcitonin and for mortality included DCM, BMP, complete blood count, D-dimer, LDH, CRP, BNP, procalcitonin and ferritin. Conclusions: This study presents a quick, accurate and cost-effective method to evaluate risk of deterioration for patients with SARS-CoV-2 infection at the time of clinical evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.