Aims
Existing electronic health records often have abundant but irregular longitudinal measurement risk factors available. We aim to leverage such data to improve the risk prediction of atherosclerotic cardiovascular disease (ASCVD) by applying machine learning algorithms, which can therefore allow the automatic screening of the population.
Methods and results
Totally 215,744 Chinese adults aged 40-79 without a history of CVD from an EHR-based longitudinal cohort study were included (6,081 cases). To allow the model interpretable, predictors of demographic characteristics, medication treatment, and repeatedly measured records of lipids, glycemia, obesity, blood pressure, and renal function were used. The primary outcome was ASCVD, defined as non-fatal acute myocardial infarction, coronary heart disease death, or fatal and non-fatal stroke. The eXtreme Gradient boosting (XGBoost) machine and LASSO regression models were derived to predict the 5-year ASCVD risk. In the validation set, compared with the refitted Chinese guideline-recommended Cox model (i.e., the China-PAR), the XGBoost model had significantly highest C-statistics (0.792, the difference in C-statistics: 0.011, 0.006-0.017, P<0.001), with the similar results for LASSO regression (the difference in C-statistics: 0.008, 0.005-0.011, P<0.001). The XGBoost model demonstrated the best calibration performance (Men: Dx=0.598, P=0.75; Women: Dx=1.867, P=0.08). Moreover, the machine learning algorithms’ risk distribution differed from the conventional model. The NRIs of XGBoost and LASSO over the Cox model were 3.9% (1.4%-6.4%) and 2.8% (0.7%-4.9%), respectively.
Conclusions
Machine learning algorithms with irregular, repeated real-world data could improve cardiovascular risk prediction. They demonstrated significantly better performance for reclassification to identify the high-risk population correctly.