A systematic review of machine learning-based missing value imputation techniques

Thomas, Tressy; Rajabi, Enayat

doi:10.1108/dta-12-2020-0298

Cited by 42 publications

(27 citation statements)

References 130 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Missing data is an inevitable and challenging issue in our retrospective study, which may lead to a biased conclusion if handled inappropriately. The K-nearest neighbors rule is an effective algorithm to impute missing data ( 34 ), although it should not be the fundamental solution. The reasons for missing data are probably because (1) the clinical significance of a series of laboratory indicators was not evidenced sufficiently as the biomarkers to predict adverse outcomes of preeclampsia.…”

Section: Discussionmentioning

confidence: 99%

Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study

Zheng

Hao²,

Khan³

et al. 2022

Front. Cardiovasc. Med.

View full text Add to dashboard Cite

IntroductionPreeclampsia, one of the leading causes of maternal and fetal morbidity and mortality, demands accurate predictive models for the lack of effective treatment. Predictive models based on machine learning algorithms demonstrate promising potential, while there is a controversial discussion about whether machine learning methods should be recommended preferably, compared to traditional statistical models.MethodsWe employed both logistic regression and six machine learning methods as binary predictive models for a dataset containing 733 women diagnosed with preeclampsia. Participants were grouped by four different pregnancy outcomes. After the imputation of missing values, statistical description and comparison were conducted preliminarily to explore the characteristics of documented 73 variables. Sequentially, correlation analysis and feature selection were performed as preprocessing steps to filter contributing variables for developing models. The models were evaluated by multiple criteria.ResultsWe first figured out that the influential variables screened by preprocessing steps did not overlap with those determined by statistical differences. Secondly, the most accurate imputation method is K-Nearest Neighbor, and the imputation process did not affect the performance of the developed models much. Finally, the performance of models was investigated. The random forest classifier, multi-layer perceptron, and support vector machine demonstrated better discriminative power for prediction evaluated by the area under the receiver operating characteristic curve, while the decision tree classifier, random forest, and logistic regression yielded better calibration ability verified, as by the calibration curve.ConclusionMachine learning algorithms can accomplish prediction modeling and demonstrate superior discrimination, while Logistic Regression can be calibrated well. Statistical analysis and machine learning are two scientific domains sharing similar themes. The predictive abilities of such developed models vary according to the characteristics of datasets, which still need larger sample sizes and more influential predictors to accumulate evidence.

show abstract

Section: Discussionmentioning

confidence: 99%

Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study

Zheng

Hao²,

Khan³

et al. 2022

Front. Cardiovasc. Med.

View full text Add to dashboard Cite

show abstract

“…MVI methods come in many flavors and can be classified into four categories: naïve imputation, feature-based imputation, global-based imputation and ensemble imputation [29,46] (See Supplementary…”

Section: Imputation Methodsmentioning

confidence: 99%

Dealing with missing values in proteomics data

et al. 2022

View full text Add to dashboard Cite

Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.

show abstract

“…There are several practices to deal and address missing data, and techniques of imputation missing values can be discovered. One of the practices that this paper attempts to discuss is an imputation techniques through machine learning algorithms [13]- [15]. A proper method of imputing can help to improve the quality of datasets for analyzing better healthcare decision.…”

Section: Introductionmentioning

confidence: 99%

Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

Ismail

Abidin

Maen³

2022

Journal of Robotics and Control

View full text Add to dashboard Cite

Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends.

show abstract

A systematic review of machine learning-based missing value imputation techniques

Cited by 42 publications

References 130 publications

Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study

Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study

Dealing with missing values in proteomics data

Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

Contact Info

Product

Resources

About