In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.
Background
Recent evidence highlights the epidemiological value of blood DNA methylation (DNAm) as surrogate biomarker for exposure to risk factors for non-communicable diseases (NCD). DNAm surrogate of exposures predicts diseases and longevity better than self-reported or measured exposures in many cases. Consequently, disease prediction models based on blood DNAm surrogates may outperform current state-of-the-art prediction models. This study aims to develop novel DNAm surrogates for cardiovascular diseases (CVD) risk factors and develop a composite biomarker predictive of CVD risk. We compared the prediction performance of our newly developed risk score with the state-of-the-art DNAm risk scores for cardiovascular diseases, the ‘next-generation’ epigenetic clock DNAmGrimAge, and the prediction model based on traditional risk factors SCORE2.
Results
Using data from the EPIC Italy cohort, we derived novel DNAm surrogates for BMI, blood pressure, fasting glucose and insulin, cholesterol, triglycerides, and coagulation biomarkers. We validated them in four independent data sets from Europe and the USA. Further, we derived a DNAmCVDscore predictive of the time-to-CVD event as a combination of several DNAm surrogates. ROC curve analyses show that DNAmCVDscore outperforms previously developed DNAm scores for CVD risk and SCORE2 for short-term CVD risk. Interestingly, the performance of DNAmGrimAge and DNAmCVDscore was comparable (slightly lower for DNAmGrimAge, although the differences were not statistically significant).
Conclusions
We described novel DNAm surrogates for CVD risk factors useful for future molecular epidemiology research, and we described a blood DNAm-based composite biomarker, DNAmCVDscore, predictive of short-term cardiovascular events. Our results highlight the usefulness of DNAm surrogate biomarkers of risk factors in epigenetic epidemiology to identify high-risk populations. In addition, we provide further evidence on the effectiveness of prediction models based on DNAm surrogates and discuss methodological aspects for further improvements. Finally, our results encourage testing this approach for other NCD diseases by training and developing DNAm surrogates for disease-specific risk factors and exposures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.