Comparison of Performance of Data Imputation Methods for Numeric Dataset

Jadhav, Anil; Pramod, Dhanya; Ramanathan, Krishnan

doi:10.1080/08839514.2019.1637138

Cited by 245 publications

(138 citation statements)

References 33 publications

Supporting

Mentioning

121

Contrasting

Unclassified

Order By: Relevance

“…The data analysis pipeline was implemented in Python (version 3.7), using the numpy (version 1.19), pandas (version 1.1) and scikit-learn (version 0.23) libraries. For imputation, the multivariate k-nearest neighbors algorithm was used [26], with k=5. For feature-selection, the recursive feature-elimination algorithm was used [27].…”

Section: Machine Learning Experimental Designmentioning

confidence: 99%

Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests

Cabitza

Campagner

Resta

et al. 2020

Clinical Chemistry and Laboratory Medicine (CCLM)

127

156

View full text Add to dashboard Cite

ObjectivesThe rRT-PCR test, the current gold standard for the detection of coronavirus disease (COVID-19), presents with known shortcomings, such as long turnaround time, potential shortage of reagents, false-negative rates around 15–20%, and expensive equipment. The hematochemical values of routine blood exams could represent a faster and less expensive alternative.MethodsThree different training data set of hematochemical values from 1,624 patients (52% COVID-19 positive), admitted at San Raphael Hospital (OSR) from February to May 2020, were used for developing machine learning (ML) models: the complete OSR dataset (72 features: complete blood count (CBC), biochemical, coagulation, hemogasanalysis and CO-Oxymetry values, age, sex and specific symptoms at triage) and two sub-datasets (COVID-specific and CBC dataset, 32 and 21 features respectively). 58 cases (50% COVID-19 positive) from another hospital, and 54 negative patients collected in 2018 at OSR, were used for internal-external and external validation.ResultsWe developed five ML models: for the complete OSR dataset, the area under the receiver operating characteristic curve (AUC) for the algorithms ranged from 0.83 to 0.90; for the COVID-specific dataset from 0.83 to 0.87; and for the CBC dataset from 0.74 to 0.86. The validations also achieved good results: respectively, AUC from 0.75 to 0.78; and specificity from 0.92 to 0.96.ConclusionsML can be applied to blood tests as both an adjunct and alternative method to rRT-PCR for the fast and cost-effective identification of COVID-19-positive patients. This is especially useful in developing countries, or in countries facing an increase in contagions.

show abstract

Section: Machine Learning Experimental Designmentioning

confidence: 99%

Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests

Cabitza

Campagner

Resta

et al. 2020

Clinical Chemistry and Laboratory Medicine (CCLM)

127

156

View full text Add to dashboard Cite

show abstract

“…The kNN algorithm is increasingly used to impute missing data in research with high volume data such as genetics and metabolomics studies [22,23]. In several recent reports the kNN algorithm was shown to produce the smallest imputation error compared to methods such as mean and median imputation, Bayesian linear regression, K-Means, K-Medoids clustering algorithms [24,25]. However, some studies reported that simpler methods such as mean or median replacement were as adequate as methods like kNN when imputation was followed by clustering of genetic data [26].…”

Section: Discussionmentioning

confidence: 99%

The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data

Sania

Pini

Nelson

et al. 2020

Preprint

View full text Add to dashboard Cite

Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

show abstract

“…But for a small percentage of missingness, imputation using the k-nearest neighbour algorithm could be used, which are more accurate than using mean/median values. 13 With the introduction of newer medications, the model performance might be affected. This limitation needs to be assessed and necessary changes in covariates should be updated to ensure a good performance of the model.…”

Section: Advantag E S and Challeng E Smentioning

confidence: 99%