Missing data imputation with fuzzy feature selection for diabetes dataset

Dzulkalnine, Mohamad Faiz; Sallehuddin, Roselina

doi:10.1007/s42452-019-0383-x

Cited by 40 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This means that the reduction of Core and Reduct dimensions increases the results of Fuzzy C-Means clustering. This applies to all distance functions.Few of the aforementioned results are linear with the previous research[25][26]. R. Zhao, L. Gu, dan X. Zhu also did research in the same field as this research.…”

supporting

confidence: 79%

Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering

Eliyanto

Surono

2021

Jurnal Ilmiah Teknik Elektro Komputer Dan Informatika

View full text Add to dashboard Cite

Fuzzy clustering aims to produce clusters that take into account the possible membership of each dataset point in a particular cluster. Fuzzy C-Means Clustering Core and Reduct is a fuzzy clustering method is a Fuzzy C-Means Clustering method that has been optimized using the reduction of Core and Reduct dimensions. The method studied is highly dependent on the distance function used. As a further in-depth study, this study was compiled to see the performance of the Fuzzy C-Means Clustering Core and Reduct using various distance functions. We aim to see how consistent the results of this method are across various distance functions and find the best distance function. The seven distance functions are applied to the same dataset. The seven distances are the Euclidean, Manhattan, Minkowski, Chebyshev, Minkowski-Chebyshev, Canberra, and Averages distances. We use UCI Machine Learning datasets for this research. The quality of the clustering results is compared through several measures. Accuracy, Silhouette score, and Davies Bouldin Index are used as internal measurements. The results of Fuzzy C-Means Core and Reduct clustering on all distance functions have significantly decreased computational load. Accuracy and purity values can be maintained with values above 80%. There was an increase in the value of the Silhouette Coefficient Score and a decrease in the Davies Bouldin Index after the application of dimension reduction. This means the quality of the clustering results can be maintained. The distance with the best evaluation result is the Euclidean distance. This method runs consistently across all tested distance functions.

show abstract

supporting

confidence: 79%

Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering

Eliyanto

Surono

2021

Jurnal Ilmiah Teknik Elektro Komputer Dan Informatika

View full text Add to dashboard Cite

show abstract

“…Accuracy in high dimensional setting, generalization of the approach (Leke et al, 2017) Deep belief network Performs well for larger missing ratio Deep neural network High approximation power Generative adversarial nets Effectively recover the data with a few parameters of the input data (Qu et al, 2018) Long-short-term memory þ support vector regression Performs well for time series block missing pattern with a high missing ratio (Li et al, 2019) Swarm intelligence Impute missing data in a high-dimensional data set (Leke and Marwala, 2016) Transfer learning Use evolutionary searches and neural networks applied in the context of transfer learning (Gupta et al, 2019) Dimensionality reduction Principal component analysis (PCA) Better classification accuracy and faster computational time (Dzulkalnine and Sallehuddin, 2019) Suitable for high level of missingness (Lai and Kuok, 2019) (continued ) k-nearest neighbors (kNN) Objective, data-driven and generic, and they can be easily applied for estimating missing precipitation (Pan et al, 2015) Accounts for MNAR (Jiang and Yang, 2015) Addresses the correlation between attributes (Lee and Styczynski, 2018) Attention to feature relevance (Liu et al, 2020) Focused on important features dealing with missing observations (Daberdaku et al, 2020) Improved performance on large data sets, cost effective, computation efficient and accurate (Keerin et al, 2012) Imputes missing data regardless of missing intervals (Teegavarapu, 2014) Local data clustering being incorporated for improved quality and efficiency (Kim et al, 2017) Missing data imputation of longitudinal clinical data (Sanjar et al, 2020) Application of...…”

Section: Cuckoo Searchmentioning

confidence: 99%

“…Deep learning-cuckoo search (DL-CS) imputation technique exhibited 87% accuracy with high-dimensional data sets and outperformed other similar deep learning imputation methods (Gupta et al, 2019). Fuzzy c-means imputation using significant features produced much lower RMSE value of 0.049 compared to 4.930 obtained with grey fuzzy neural network (GFNN) on the experimentation data set (Dzulkalnine and Sallehuddin, 2019). Even at 60% missing rate, the semi-supervised RF imputation method showed an accuracy of 87% (Ishioka, 2013).…”

Section: Rq3: Evaluation Of Imputationmentioning

confidence: 99%

A systematic review of machine learning-based missing value imputation techniques

Thomas

Rajabi

2021

DTA

View full text Add to dashboard Cite

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

show abstract

“…Biessmann et al [14] uses the deep learning model for truthful imputation of non-numeric values. Dzulkalnine et al [15] implement the feature selection hybrid model to impute the missing data by the integration of the Fuzzy Principle component analysis (FPCA), support vector machine, and the Fuzzy c-means (FCM) to select the relevant features only in the missed data treatment process. Sherif et al [16] offered a new approach using clustering, the local least square imputation method, then select the smallest Euclidian distance to catch the missed data value from a similar cluster to the missed value.…”

Section: Missing Data Handlingmentioning

confidence: 99%

Diabetes classification application with efficient missing and outliers data handling algorithms

Torkey

Ibrahim

Hemdan

et al. 2021

Complex Intell. Syst.

View full text Add to dashboard Cite

Communication between sensors spread everywhere in healthcare systems may cause some missing in the transferred features. Repairing the data problems of sensing devices by artificial intelligence technologies have facilitated the Medical Internet of Things (MIoT) and its emerging applications in Healthcare. MIoT has great potential to affect the patient's life. Data collected from smart wearable devices size dramatically increases with data collected from millions of patients who are suffering from diseases such as diabetes. However, sensors or human errors lead to missing some values of the data. The major challenge of this problem is how to predict this value to maintain the data analysis model performance within a good range. In this paper, a complete healthcare system for diabetics has been used, as well as two new algorithms are developed to handle the crucial problem of missed data from MIoT wearable sensors. The proposed work is based on the integration of Random Forest, mean, class' mean, interquartile range (IQR), and Deep Learning to produce a clean and complete dataset. Which can enhance any machine learning model performance. Moreover, the outliers repair technique is proposed based on dataset class detection, then repair it by Deep Learning (DL). The final model accuracy with the two steps of imputation and outliers repair is 97.41% and 99.71% Area Under Curve (AUC). The used healthcare system is a web-based diabetes classification application using flask to be used in hospitals and healthcare centers for the patient diagnosed with an effective fashion.

show abstract

Missing data imputation with fuzzy feature selection for diabetes dataset

Cited by 40 publications

References 27 publications

Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering

Distance Functions Study in Fuzzy C-Means Core and Reduct Clustering

A systematic review of machine learning-based missing value imputation techniques

Diabetes classification application with efficient missing and outliers data handling algorithms

Contact Info

Product

Resources

About