Abstract:In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c) has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG), age, and gender. We then created missing A1c values under two assumptions: missing completely at rand… Show more
“…Baseline status for comorbid conditions was assigned based on two or more outpatient diagnosis codes or one or more inpatient diagnosis codes on or before the cohort entry date for chronic kidney disease (CKD; ICD-9 code 585.xx), CVD (ICD-9 codes 410–414.xx and 429.2), heart failure (HF; ICD-9 codes 428–428.9), hemorrhagic stroke (ICD-9 codes 430–432.9), ischemic stroke (ICD-9 codes 433–434.91), and transient ischemic attack (ICD-9 code 435.xx). Multiple imputation was used for missing data on A1C, LDL-C, and HDL-C following previous work using the SUPREME-DM cohort (12). …”
OBJECTIVEThe objective of this study was to assess the incidence of major cardiovascular (CV) hospitalization events and all-cause deaths among adults with diabetes with or without CV disease (CVD) associated with inadequately controlled glycated hemoglobin (A1C), high LDL cholesterol (LDL-C), high blood pressure (BP), and current smoking.RESEARCH DESIGN AND METHODSStudy subjects included 859,617 adults with diabetes enrolled for more than 6 months during 2005–2011 in a network of 11 U.S. integrated health care organizations. Inadequate risk factor control was classified as LDL-C ≥100 mg/dL, A1C ≥7% (53 mmol/mol), BP ≥140/90 mm Hg, or smoking. Major CV events were based on primary hospital discharge diagnoses for myocardial infarction (MI) and acute coronary syndrome (ACS), stroke, or heart failure (HF). Five-year incidence rates, rate ratios, and average attributable fractions were estimated using multivariable Poisson regression models.RESULTSMean (SD) age at baseline was 59 (14) years; 48% of subjects were female, 45% were white, and 31% had CVD. Mean follow-up was 59 months. Event rates per 100 person-years for adults with diabetes and CVD versus those without CVD were 6.0 vs. 1.7 for MI/ACS, 5.3 vs. 1.5 for stroke, 8.4 vs. 1.2 for HF, 18.1 vs. 40 for all CV events, and 23.5 vs. 5.0 for all-cause mortality. The percentages of CV events and deaths associated with inadequate risk factor control were 11% and 3%, respectively, for those with CVD and 34% and 7%, respectively, for those without CVD.CONCLUSIONSAdditional attention to traditional CV risk factors could yield further substantive reductions in CV events and mortality in adults with diabetes.
“…Baseline status for comorbid conditions was assigned based on two or more outpatient diagnosis codes or one or more inpatient diagnosis codes on or before the cohort entry date for chronic kidney disease (CKD; ICD-9 code 585.xx), CVD (ICD-9 codes 410–414.xx and 429.2), heart failure (HF; ICD-9 codes 428–428.9), hemorrhagic stroke (ICD-9 codes 430–432.9), ischemic stroke (ICD-9 codes 433–434.91), and transient ischemic attack (ICD-9 code 435.xx). Multiple imputation was used for missing data on A1C, LDL-C, and HDL-C following previous work using the SUPREME-DM cohort (12). …”
OBJECTIVEThe objective of this study was to assess the incidence of major cardiovascular (CV) hospitalization events and all-cause deaths among adults with diabetes with or without CV disease (CVD) associated with inadequately controlled glycated hemoglobin (A1C), high LDL cholesterol (LDL-C), high blood pressure (BP), and current smoking.RESEARCH DESIGN AND METHODSStudy subjects included 859,617 adults with diabetes enrolled for more than 6 months during 2005–2011 in a network of 11 U.S. integrated health care organizations. Inadequate risk factor control was classified as LDL-C ≥100 mg/dL, A1C ≥7% (53 mmol/mol), BP ≥140/90 mm Hg, or smoking. Major CV events were based on primary hospital discharge diagnoses for myocardial infarction (MI) and acute coronary syndrome (ACS), stroke, or heart failure (HF). Five-year incidence rates, rate ratios, and average attributable fractions were estimated using multivariable Poisson regression models.RESULTSMean (SD) age at baseline was 59 (14) years; 48% of subjects were female, 45% were white, and 31% had CVD. Mean follow-up was 59 months. Event rates per 100 person-years for adults with diabetes and CVD versus those without CVD were 6.0 vs. 1.7 for MI/ACS, 5.3 vs. 1.5 for stroke, 8.4 vs. 1.2 for HF, 18.1 vs. 40 for all CV events, and 23.5 vs. 5.0 for all-cause mortality. The percentages of CV events and deaths associated with inadequate risk factor control were 11% and 3%, respectively, for those with CVD and 34% and 7%, respectively, for those without CVD.CONCLUSIONSAdditional attention to traditional CV risk factors could yield further substantive reductions in CV events and mortality in adults with diabetes.
“…An account of available software facilitated modelling using MI in diabetes studies is given in ref. [17] Despite regarded as "state of the art", EM and MI techniques are computationally very intensive, especially MI, which is rather a statistical experiment featuring an imputation method. Apart from the design, the biggest contributor to the problem is the multitude of model parameters as their number is dependent on the number of problem dimensions and can grow explosively with model complexity.…”
Missing values may be present in data without undermining its use for diagnostic / classification purposes but compromise application of readily available software. Surrogate entries can remedy the situation, although the outcome is generally unknown. Discretization of continuous attributes renders all data nominal and is helpful in dealing with missing values; particularly, no special handling is required for different attribute types. A number of classifiers exist or can be reformulated for this representation. Some classifiers can be reinvented as data completion methods. In this work the Decision Tree, Nearest Neighbour, and Naive Bayesian methods are demonstrated to have the required aptness. An approach is implemented whereby the entered missing values are not necessarily a close match of the true data; however, they intend to cause the least hindrance for classification. The proposed techniques find their application particularly in medical diagnostics. Where clinical data represents a number of related conditions, taking Cartesian product of class values of the underlying sub-problems allows narrowing down of the selection of missing value substitutes. Real-world data examples, some publically available, are enlisted for testing. The proposed and benchmark methods are compared by classifying the data before and after missing value imputation, indicating a significant improvement.
“…A study by Rose et al [18] discussed the correlation between RBS and HbA1c levels. Stanley et al [19] used a linear regression model for imputation of missing HbA1c data. Their model calculates HbA1c levels for patient records with missing HbA1c values as continuous and categorical values and uses 4 predictors extracted from an EHR system: RBS, FBS, along with age and gender, as predictors to calculate the level of HbA1c for a diabetic population.…”
Background
Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems, such as diabetes. Early preventive interventions based upon advanced predictive models using electronic health records data for identifying such patients can ultimately help provide better health outcomes.
Objective
Our study investigated the performance of predictive models to forecast HbA1c elevation levels by employing several machine learning models. We also examined the use of patient electronic health record longitudinal data in the performance of the predictive models. Explainable methods were employed to interpret the decisions made by the black box models.
Methods
This study employed multiple logistic regression, random forest, support vector machine, and logistic regression models, as well as a deep learning model (multilayer perceptron) to classify patients with normal (<5.7%) and elevated (≥5.7%) levels of HbA1c. We also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and provide an understanding of the reasons behind the decisions made by the models. All models were trained and tested using a large data set from Saudi Arabia with 18,844 unique patient records.
Results
The machine learning models achieved promising results for predicting current HbA1c elevation risk. When coupled with longitudinal data, the machine learning models outperformed the multiple logistic regression model used in the comparative study. The multilayer perceptron model achieved an accuracy of 83.22% for the area under receiver operating characteristic curve when used with historical data. All models showed a close level of agreement on the contribution of random blood sugar and age variables with and without longitudinal data.
Conclusions
This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels (≥5.7% or less). Using patients’ longitudinal data improved the performance and affected the relative importance for the predictors used. The models showed results that are consistent with comparable studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.