Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.
Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction.
Cigarette smoking is the leading cause of preventable death in a general population and it seems a significant topic in health research. The primary aim of this study determines the significant risk factors and investigates the prediction of 6 months smoking cessation program among women in Korea. In this regard, we examined real-world dataset about a smoking cessation program among the only women from Chungbuk Tobacco Control Center of Chungbuk National University College of Medicine in South Korea which collected from 2015 to 2017. Accordingly, we carried out to compare four machine learning techniques: Logistic regression (LR), Support Vector Machine (SVM), Random Forest (RF) and Naï ve Bayes (NB) in order to predict response for successful or unsuccessful smoking quitters. Totally we analyzed 60 set of features that may affect the association between smoking cessation such as socio-demographic characteristics, smoking status for the age of starting, duration and others by employing a filter-based feature selection method. Respectively, we identified significant 8 factors which associated with smoking cessation. The experimental results demonstrate that NB performs better than other classifiers. Moreover, the performance of prediction models as measured by Accuracy, Precision, Recall, F-measure and ROC area. This finding has gone some way towards enhancing our better understanding of the significant factors contributing to smoking cessation program implementation and accompanying to concern public health.
Developing lifelong learning algorithms are mandatory for computational systems biology. Recently, many studies have shown how to extract biologically relevant information from high-dimensional data to understand the complexity of cancer by taking the benefit of deep learning (DL). Unfortunately, new cancer growing up into the hundred types that make systems difficult to classify them efficiently. In contrast, the current state-of-the-art continual learning (CL) methods are not designed for the dynamic characteristics of high-dimensional data. And data security and privacy are some of the main issues in the biomedical field. This paper addresses three practical challenges for class-incremental learning (Class-IL) such as data privacy, high-dimensionality, and incremental learning problems. To solve this, we propose a novel continual learning approach, called Deep Generative Feature Replay (DGFR), for cancer classification tasks. DGFR consists of an incremental feature selection (IFS) and a scholar network (SN). IFS is used for selecting the most significant CpG sites from high-dimensional data. We investigate different dimensions to find an optimal number of selected CpG sites. SN employs a deep generative model for generating pseudo data without accessing past samples and a neural network classifier for predicting cancer types. We use a variational autoencoder (VAE), which has been successfully applied to this research field in previous works. All networks are sequentially trained on multiple tasks in the Class-IL setting. We evaluated the proposed method on the publicly available DNA methylation data. The experimental results show that the proposed DGFR achieves a significantly superior quality of cancer classification tasks with various state-of-the-art methods in terms of accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.