2021
DOI: 10.3389/fdata.2021.693674
|View full text |Cite
|
Sign up to set email alerts
|

A Benchmark for Data Imputation Methods

Abstract: With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
35
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 82 publications
(51 citation statements)
references
References 33 publications
0
35
0
Order By: Relevance
“…A total of 16 blood biomarkers (i.e., total cholesterol, triglyceride, glycated hemoglobin, urea, creatinine, high-sensitivity C-reactive protein, platelet count, white blood cell count, mean corpuscular volume, glucose, high-density lipoprotein, low-density lipoprotein, hemoglobin, cystatin, uric acid, and hematocrit) were measured in the 2011/2012 wave of CHARLS ( 24 ), plus systolic and diastolic blood pressure, and pulse, resulting in 19 candidate biomarkers for the initial consideration in this study. We first imputed the missing data with the mean and normalized data using a min-max scalar, because data imputation and normalization were the necessary steps in the process of ML ( 26 , 27 ). Imputing missing values contributed to the improved predictive power regardless of the conditions of missingness ( 26 ).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…A total of 16 blood biomarkers (i.e., total cholesterol, triglyceride, glycated hemoglobin, urea, creatinine, high-sensitivity C-reactive protein, platelet count, white blood cell count, mean corpuscular volume, glucose, high-density lipoprotein, low-density lipoprotein, hemoglobin, cystatin, uric acid, and hematocrit) were measured in the 2011/2012 wave of CHARLS ( 24 ), plus systolic and diastolic blood pressure, and pulse, resulting in 19 candidate biomarkers for the initial consideration in this study. We first imputed the missing data with the mean and normalized data using a min-max scalar, because data imputation and normalization were the necessary steps in the process of ML ( 26 , 27 ). Imputing missing values contributed to the improved predictive power regardless of the conditions of missingness ( 26 ).…”
Section: Methodsmentioning
confidence: 99%
“…We first imputed the missing data with the mean and normalized data using a min-max scalar, because data imputation and normalization were the necessary steps in the process of ML ( 26 , 27 ). Imputing missing values contributed to the improved predictive power regardless of the conditions of missingness ( 26 ). Training models with normalized data usually helped to enhance performance; thus data normalization was an essential step in ML as well ( 27 ).…”
Section: Methodsmentioning
confidence: 99%
“…Those data were further used to (1) draw a comprehensive map of the association between estimated renal metabolism at reperfusion and one-year eGFR, and (2) to predict the renal graft function at one year. This outcome was selected as the renal graft function one year after transplantation has been largely identified as a major factor associated with graft survival [34][35][36][37][38][39][40][41]. Using multivariable analyses, two studies have also shown that estimated one-year GFR was the best predictor of long-term renal graft survival [10,11].…”
Section: Discussionmentioning
confidence: 99%
“…As a second model, we adopted a more sophisticated technique. As missing data were present for three variables (donor serum creatinine, donor age and warm ischemia time), we first put in data using bagged tree imputation [35]. The distribution of those variables, before and after the imputation, is reported in Supplementary Figure S4.…”
Section: Approach Outperforms Classic Statisical Methodsmentioning
confidence: 99%
“…Numerical data were centred, scaled and underwent Yeo-Johnson transformation. Missing data were further imputed using bagged tree imputation [18]. This step was completed using the caret package.…”
Section: Data Pre-processingmentioning
confidence: 99%