DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values

Wang, Qian; Cao, Weijia; Guo, J.; Ren, Jiadong; Cheng, Yongqiang; Davis, Darryl N.

doi:10.1109/access.2019.2929866

Cited by 84 publications

(41 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In such a technique, the new imputed value could be far from the central tendency of the population distribution. The performance in the pipeline (see Table 9) employed in [18], [20], [41], [42], [46] is less as comparing the proposed framework and others in [7], [44], [45]. Those fewer performances clearly indicate the role of outlier rejection and filling missing values in the PID dataset.…”

Section: E Results Comparisonmentioning

confidence: 99%

Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

et al. 2020

View full text Add to dashboard Cite

Diabetes, also known as chronic illness, is a group of metabolic diseases due to a high level of sugar in the blood over a long period. The risk factor and severity of diabetes can be reduced significantly if the precise early prediction is possible. The robust and accurate prediction of diabetes is highly challenging due to the limited number of labeled data and also the presence of outliers (or missing values) in the diabetes datasets. In this literature, we are proposing a robust framework for diabetes prediction where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers (k-nearest Neighbour, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost) and Multilayer Perceptron (MLP) were employed. The weighted ensembling of different ML models is also proposed, in this literature, to improve the prediction of diabetes where the weights are estimated from the corresponding Area Under ROC Curve (AUC) of the ML model. AUC is chosen as the performance metric, which is then maximized during hyperparameter tuning using the grid search technique. All the experiments, in this literature, were conducted under the same experimental conditions using the Pima Indian Diabetes Dataset. From all the extensive experiments, our proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio, and AUC as 0.789, 0.934, 0.092, 66.234, and 0.950 respectively which outperforms the state-of-the-art results by 2.00 % in AUC. Our proposed framework for the diabetes prediction outperforms the other methods discussed in the article. It can also provide better results on the same dataset which can lead to better performance in diabetes prediction. Our source code for diabetes prediction is made publicly available. INDEX TERMS Diabetes prediction, ensembling classifier, machine learning, multilayer perceptron, missing values and outliers, Pima Indian Diabetic dataset.

show abstract

Section: E Results Comparisonmentioning

confidence: 99%

Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In addition, a study used the SMOTE with several ML algorithms to predict acute myocardial infarction < 1 month and all-cause mortality < 1 month for MACE in emergency department patients with chest pain [23]. Furthermore, in [24] the authors proposed a prediction algorithm for diabetes mellitus classification on imbalanced data using the adaptive synthetic (ADASYN) sampling technique to reduce the influence of class imbalance, then using a RF classifier to generate predictions models.…”

Section: B Imbalanced Data Solution In Medical Domainmentioning

confidence: 99%

A Stacking Ensemble Prediction Model for the Occurrences of Major Adverse Cardiovascular Events in Patients With Acute Coronary Syndrome on Imbalanced Data

2021

View full text Add to dashboard Cite

The major adverse cardiovascular events (MACE) often occur with high morbidity and mortality globally. It is very important to predict the MACE occurrences accurately in patients with acute coronary syndrome (ACS). Therefore, this paper proposes a stacking ensemble model for the prediction of MACE occurrences in patients with ACS at early stage. Our research contents are given as follows. First, we use Korea Acute Myocardial Infarction Registry National Institutes of Health (KAMIR-NIH) dataset and experimental data are extracted from the raw data and preprocessed. Second, we apply three data sampling approaches, such as borderline synthetic minority oversampling technique (Borderline-SMOTE1), cluster centroids undersampling, and synthetic minority oversampling techniques (SMOTE) plus Tomek Links (SMOTETomek) hybrid technique, to solve the class imbalance problem. Third, to develop a stacking ensemble prediction model for the occurrences of MACE, we apply seven widely used machine learning algorithms, such as logistic regression (LR), support vector machine (SVM), K-Nearest Neighbors (KNN), decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), and adaptive boosting (AdaBoost), as base learners. Fourth, the performance of proposed stacking ensemble model is compared with the seven base learners using the three data sampling techniques. In the result, the proposed stacking ensemble model with the SMOTETomek showed the best performance with 0.9862 accuracy, 0.9976 precision, 0.975 recall, 0.9862 f1_score, 0.9863 g-mean, and 0.9863 AUC and provided a better solution for imbalanced dataset. Consequently, our finding was that the proposed stacking ensemble model with the SMOTETomek outperformed the base learners and improved the accuracy of diagnosis and prediction of the MACE occurrences in patients with ACS at early stage.INDEX TERMS Major adverse cardiovascular events (MACE), acute coronary syndrome (ACS), stacking ensemble classifier, machine learning, data sampling, imbalanced data.

show abstract

“…However, only a few studies discussed about preprocessing on Pima Indian dataset. The problem of missing value is discussed in a limited number of papers [8,13,14,15,17]. The problem of imbalanced data [10,11,17] and of feature selection [5,9,10,14] have been discussed too.…”

Section: Literature Reviewmentioning

confidence: 99%

“…There are several studies that discussed diabetes diagnosis prediction based on data. Besides Pima Indian dataset [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17], there is also data from Luzhou [4], Irvine [18], Kashmir [19,20], online questionnaire [21], and dr. Schorling [9,21]. There are various classification methods on diabetes diagnosis prediction like random forest, J48, naïve bayes (NB), support vector machine (SVM), logistic regression, neural network (NN), and K-Nearest Neighbors.…”

Section: Introductionmentioning

confidence: 99%

Preprocessing Handling to Enhance Detection of Type 2 Diabetes Mellitus based on Random Forest

Ramadhan¹,

Adiwijaya²,

Romadhony³

2021

IJACSA

View full text Add to dashboard Cite

Diabetes is a non-communicable disease that has a death rate of 70% in the world. Majority of diabetes cases, 90-95%, are of diabetes cases are type 2 diabetes which is caused by an unhealthy lifestyle. Type 2 diabetes can be detected earlier by using examination that contains diabetes-related parameters. However, the dataset does not always contain complete information, the distribution between positive and negative classes is mostly imbalanced, and some parameters have low importance to the decision class. To overcome the problems, this study needs to carry out preprocessing to improve detection precision and recall. In this paper, propose an approach on dataset preprocessing, which is applied to diabetes prediction. The preprocessing approach consists of the following process: missing value process, imbalanced data process, feature importance process, and data augmentation process. The data preprocessing process uses the median for missing value, random oversampling for imbalanced data, the Gini score in the random forest for feature importance, and posterior distribution for data augmentation. This research used random forest and logistic regression as classification algorithms. The experimental results show that the classification increased by 20% precision and 24% recall by applying proposed method and random forest method compared to without proposed method and random forest method.

show abstract

DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values

Cited by 84 publications

References 21 publications

Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

A Stacking Ensemble Prediction Model for the Occurrences of Major Adverse Cardiovascular Events in Patients With Acute Coronary Syndrome on Imbalanced Data

Preprocessing Handling to Enhance Detection of Type 2 Diabetes Mellitus based on Random Forest

Contact Info

Product

Resources

About