A nonparametric multiple imputation approach for missing categorical data

Zhou, Muhan; He, Yulei; Yu, Mandi; Hsu, Chiu Hsieh

doi:10.1186/s12874-017-0360-2

Cited by 6 publications

(5 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…KNN and RF have been reported to have excellent imputation performance in relevant studies [5][6][7][8][9][10]27 , but these researches were based on real-data applications with continuous or mixed variables in limited application scenarios. In this study, KNN only had a relatively moderate imputation accuracy in scenarios with a value distribution of 7:3, but it had poor performance in most scenarios, which may be caused by the local structure of data in these simulation scenarios.…”

Section: Discussionmentioning

confidence: 99%

“…Many researchers have successively used different datasets to compare the performance of traditional statistical and machine learning imputation methods, but the conclusions were different. Wei et al 5 , Waljee et al 6 , Shah et al 7 demonstrated respectively that RF outperforms other imputation methods in their datasets; Jerez et al 8 , Zhou et al 9 , Jadhav et al 10 found KNN outperforms other imputation methods in their datasets; Chlioui et al 11 found SVM performs best in two numeric datasets, while Tsai 12 found DT performs best in mixed datasets. Furthermore, both the ensemble learning (EL) algorithm proposed by Wang 13 and the generative adversarial imputation nets (GAIN) algorithm proposed by Dong 14 have been reported as possessing satisfactory imputation performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

Zhang

2023

Sci Rep

View full text Add to dashboard Cite

The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

Zhang

2023

Sci Rep

View full text Add to dashboard Cite

show abstract

“… 11 Multiple imputation was applied for variables with less than 30% missing data. 12 , 13 Potential covariates included preoperative demographic variables routinely collected in the secure clinical repository. A total of 27 preoperative demographic variables were included age, body mass index, sex, race, smoking history, alcohol use history, hypertension, diabetes mellitus (type I or II), autoimmune conditions (systemic lupus erythematosus, etc.…”

Section: Methodsmentioning

confidence: 99%

Development of Machine Learning Algorithms to Predict Being Lost to Follow-up After Hip Arthroscopy for Femoroacetabular Impingement Syndrome

Kunze

Burnett

Lee³

et al. 2020

Arthroscopy, Sports Medicine, and Rehabilitation

View full text Add to dashboard Cite

Purpose To determine factors predictive of patients who are at risk for being lost to follow-up after hip arthroscopy for femoroacetabular impingement syndrome (FAIS). Methods A prospective clinical repository was queried between January 2012 and October 2017 and all patients who underwent hip arthroscopy for primary or revision FAIS with minimum 2-year follow-up were included. A total of 27 potential risk factors for loss to follow-up were available and tested for predictive value. An 80:20 random sample split of all patients was performed to create training and testing sets. Cross-validation, minimum Bayes information criteria, and adaptive machine-learning algorithms were used to develop the predictive model. The model with the best predictive performance was selected based off of the lowest postestimation deviance between the training and testing samples. The c-statistic is a measure of discrimination. It ranges from 0.5 to 1.0, with 1.0 being perfect discrimination and 0.5 indicating the model is no better than chance. A log-likelihood χ 2 test was used to evaluate the goodness-of-fit of the logistic regression model. Results A total of 2113 patients were included. Inference of minimum Bayes information criteria model indicated that male sex (odds ratio [OR] 1.82, P = .028), non-white race (African American OR 2.41, P = .013; other non-white OR 1.42, P = .042), smoking (OR 1.07, P = .021), and failure to provide a phone number (OR 1.78, P = .032) increased the risk for being lost to follow-up. Furthermore, greater preoperative International Hip Outcome Tool 12-item component questionnaire (OR 1.03, P = .004), and modified Harris Hip Score (OR 1.05, P = .014) scores increased the risk of being lost to follow-up. The c-statistic was 0.76 (95% confidence interval 0.701-0.848). The log-likelihood indicated that the regression model as a whole was statistically significant ( P = .002). Conclusions Patients who are male, non-white, smokers, fail to provide a telephone number, and have greater preoperative modified Harris Hip Score and International Hip Outcome Tool 12-item component questionnaire scores are at an increased risk for being lost to follow-up 2 years after hip arthroscopy for FAIS. Level of Evidence Level III, case control study

show abstract

“…Methods of imputations were as follows: predictive mean matching for continuous variables, logistic regression imputation for dichotomous variables, and polytomous regression imputation for categorical variables with more than 2 levels. [24][25][26] Multiple imputation processes were repeated for all variables (1 predictor, 10 outcomes, and 23 covariates) five times to generate five complete, imputed data sets using the "mice" function of the "mice" package 24 in R software. 27 Estimates were calculated by pooling the five sets of results from logistic regression models of five imputed data sets, using "pool" function of the "mice" package.…”

Section: Multiple Imputation Process and Analysesmentioning

confidence: 99%

Weekend delivery and maternal–neonatal adverse outcomes in low‐risk pregnancies in the United States: A population‐based analysis of 3‐million live births

Kim

Selya

2022

Birth

View full text Add to dashboard Cite

Background Childbirth is the most common cause of hospital admission in the United States. Previous studies have shown that there might be a “weekend effect” in perinatal care, indicating that mothers and newborns whose deliveries occur during the weekends are at increased risk of having adverse outcomes. This study aims to isolate the association between the weekend delivery and maternal–neonatal adverse outcomes by investigating low‐risk pregnancies in nationwide data. Methods A population‐based study of all low‐risk pregnancies (in‐hospital, nonanomalous, term, normal birthweight, and singleton) was conducted based on US national natality data in 2017. Four maternal outcomes (ICU admission, uterine rupture, blood transfusion, and perineal laceration) and three neonatal outcomes (5‐minute Apgar <7, NICU admission, and neonatal death) were defined as adverse outcomes. Logistic regression analyses were conducted to determine the association, adjusting for 23 maternal and neonatal characteristics and risk factors. Results Among 3 011 577 low‐risk pregnancies, 6.0% were reported to have at least one of the maternal–neonatal adverse outcomes. Weekend deliveries were significantly associated with six maternal–neonatal adverse outcomes with an exception of neonatal death. In general, weekend deliveries were 1.13 times significantly as likely to have any of seven maternal–neonatal adverse outcomes than weekday deliveries (OR 1.13, 95% CI 1.11‐1.14), being attributed to adverse outcomes of more than 4500 mother–newborn pairs. Conclusions Weekend delivery is a consistent risk factor for both mothers and babies at the national level. Furthermore, studies are needed about possible modifiable factors that mediate these associations to ensure safe childbirth regardless of the day of delivery.

show abstract

A nonparametric multiple imputation approach for missing categorical data

Cited by 6 publications

References 23 publications

A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods

Development of Machine Learning Algorithms to Predict Being Lost to Follow-up After Hip Arthroscopy for Femoroacetabular Impingement Syndrome

Weekend delivery and maternal–neonatal adverse outcomes in low‐risk pregnancies in the United States: A population‐based analysis of 3‐million live births

Contact Info

Product

Resources

About