Improved KNN Imputation for Missing Values in Gene Expression Data

Keerin, Phimmarin; Boongoen, Tossapon

doi:10.32604/cmc.2022.020261

Cited by 16 publications

(8 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…An insurance dataset was obtained and input into R-studio with variables (Third-party, comprehensive, marine, and Fire + stolen) clearly defined. Running a series of pattern extraction and analytical functions in R-studio detected missing values from the dataset with 89.5% classification accuracy as summarized in Figure 13; The findings from this study are consistent with the results from similar studies [28], [29] 383 performance in the experimental replacement of numerical values due to its unique ability to classify the missing parameters and assign cluster ratios for each type unlike other techniques that perform replacement in whole datasets based on the normalized computation of mean absolute errors and root mean square error. A study [31] observes that imputation based on computational and statistical models is recommended by scientists due to its unique ability to determine the missing values by averaging a summarized likelihood function of the entire dataset over a mathematically defined predictive distribution with considerably high precision.…”

Section: Resultssupporting

confidence: 88%

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

Karrar

2022

IJEEI

View full text Add to dashboard Cite

The evolution of big data analytics through machine learning and artificial intelligence techniques has caused organizations in a wide range of sectors including health, manufacturing, e-commerce, governance, and social welfare to realize the value of massive volumes of data accumulating on web-based repositories daily. This has led to the adoption of data-driven decision models; for example, through sentiment analysis in marketing where produces leverage customer feedback and reviews to develop customer-oriented products. However, the data generated in real-world activities is subject to errors resulting from inaccurate measurements or fault input devices, which may result in the loss of some values. Missing attribute/variable values make data unsuitable for decision analytics due to noises and inconsistencies that create bias. The objective of this paper was to explore the problem of missing data and develop an advanced imputation model based on Machine Learning and implemented on K-Nearest Neighbor (KNN) algorithm in R programming language as an approach to handle missing values. The methodology used in this paper relied on the applying advanced machine learning algorithms with high-level accuracy in pattern detection and predictive analytics on the existing imputation techniques, which handle missing values by random replacement or deletion.. According to the results, advanced imputation technique based on machine learning models replaced missing values from a dataset with 89.5% accuracy. The experimental results showed that pre-processing by imputation delivers high-level performance efficiency in handling missing data values. These findings are consistent with the key idea of paper, which is to explore alternative imputation techniques for handling missing values to improve the accuracy and reliability of decision insights extracted from datasets.

show abstract

Section: Resultssupporting

confidence: 88%

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

Karrar

2022

IJEEI

View full text Add to dashboard Cite

show abstract

“…However, an inter-class overlap may dampen the quality of this local approach, as compared to the clustering-oriented technique such as SingleClus. The same problem is also witnessed for the task of imputing missing values, where clustering information can be exploited to improve the accuracy of estimates of those missing ones [37,38]. Nonetheless, the use of a single clustering seen with SingleClus may overlook patterns exhibited in data under examination.…”

Section: Resultsmentioning

confidence: 96%

Strengthening intrusion detection system for adversarial attacks: improved handling of imbalance classification problem

Pimsarn

Boongoen

Iam-On

et al. 2022

Complex Intell. Syst.

Self Cite

View full text Add to dashboard Cite

Most defence mechanisms such as a network-based intrusion detection system (NIDS) are often sub-optimal for the detection of an unseen malicious pattern. In response, a number of studies attempt to empower a machine-learning-based NIDS to improve the ability to recognize adversarial attacks. Along this line of research, the present work focuses on non-payload connections at the TCP stack level, which is generalized and applicable to different network applications. As a compliment to the recently published investigation that searches for the most informative feature space for classifying obfuscated connections, the problem of class imbalance is examined herein. In particular, a multiple-clustering-based undersampling framework is proposed to determine the set of cluster centroids that best represent the majority class, whose size is reduced to be on par with that of the minority. Initially, a pool of centroids is created using the concept of ensemble clustering that aims to obtain a collection of accurate and diverse clusterings. From that, the final set of representatives is selected from this pool. Three different objective functions are formed for this optimization driven process, thus leading to three variants of FF-Majority, FF-Minority and FF-Overall. Based on the thorough evaluation of a published dataset, four classification models and different settings, these new methods often exhibit better predictive performance than its baseline, the single-clustering undersampling counterpart and state-of-the-art techniques. Parameter analysis and implication for analyzing an extreme case are also provided as a guideline for future applications.

show abstract

“…The missing values are only computed from the instance subset that is a high correlation with the sample that contains the missing values [15]. The k-nearest neighbor imputation (KNNimpute) and local least square imputation (LLSimpute) are widely used existing imputation methods are among this approach category [8], [19][20]. For The KNNimpute, this method has performed k-nearest neighbor algorithms depending on k number of high sample correlation with gene contained missing value to compute missing data in the dataset.…”

Section: Table I Missing Data Imputation Algorithms Categorized Into ...mentioning

confidence: 99%

“…However, this method is not a suitable solution for some datasets which consist of many incomplete values. Although the machine learning algorithms were exploited in numerous estimation applications for the time ahead prediction and objective classification, various up-to-date imputation methods were also proposed to handle this problem effectively via using convenient machine learning algorithms such as the regression method [8], the k-nearest neighbor method [9], deep learning approach [10][11], the neural network-based method [12] with advanced statistics strategies [13], [14]. The most appropriate value estimation predicted by these imputations used incompatible algorithms.…”

Section: Introductionmentioning

confidence: 99%

An Intelligent Missing Data Imputation Techniques: A Review

Park

Kang

Lee

2022

JOIV : Int. J. Inform. Visualization

View full text Add to dashboard Cite

The incomplete dataset is an unescapable problem in data preprocessing that primarily machine learning algorithms could not employ to train the model. Various data imputation approaches were proposed and challenged each other to resolve this problem. These imputations were established to predict the most appropriate value using different machine learning algorithms with various concepts. Furthermore, accurate estimation of the imputation method is exceptionally critical for some datasets to complete the missing value, especially imputing datasets in medical data. The purpose of this paper is to express the power of the distinguished state-of-the-art benchmarks, which have included the K-nearest Neighbors Imputation (KNNImputer) method, Bayesian Principal Component Analysis (BPCA) Imputation method, Multiple Imputation by Center Equation (MICE) Imputation method, Multiple Imputation with denoising autoencoder neural network (MIDAS) method. These methods have contributed to the achievable resolution to optimize and evaluate the appropriate data points for imputing the missing value. We demonstrate the experiment with all these imputation techniques based on the same four datasets which are collected from the hospital. Both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are utilized to measure the outcome of implementation and compare with each other to prove an extremely robust and appropriate method that overcomes missing data problems. As a result of the experiment, the KNNImputer and MICE have performed better than BPCA and MIDAS imputation, and BPCA has performed better than the MIDAS algorithm.

show abstract

Improved KNN Imputation for Missing Values in Gene Expression Data

Cited by 16 publications

References 49 publications

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

The Effect of Using Data Pre-Processing by Imputations in Handling Missing Values

Strengthening intrusion detection system for adversarial attacks: improved handling of imbalance classification problem

An Intelligent Missing Data Imputation Techniques: A Review

Contact Info

Product

Resources

About