2021
DOI: 10.3390/agriculture11080727
|View full text |Cite
|
Sign up to set email alerts
|

A Comparative Study of Various Methods for Handling Missing Data in UNSODA

Abstract: UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, satur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(6 citation statements)
references
References 61 publications
(65 reference statements)
0
4
0
Order By: Relevance
“…The number of fields sampled per farm were not uniform throughout, and the managerial variables were inconsistently recorded each year. As is common practice within this research area (Fu et al., 2021), data with complete sets of values were used for statistical analysis, carried out using R 4.0.2 (R Core Team, 2021). The dataset finally used for management intensity analysis therefore included 16 farms, 43 fields and a total of 188 observations across the 10‐year timespan (Table S1).…”
Section: Methodsmentioning
confidence: 99%
“…The number of fields sampled per farm were not uniform throughout, and the managerial variables were inconsistently recorded each year. As is common practice within this research area (Fu et al., 2021), data with complete sets of values were used for statistical analysis, carried out using R 4.0.2 (R Core Team, 2021). The dataset finally used for management intensity analysis therefore included 16 farms, 43 fields and a total of 188 observations across the 10‐year timespan (Table S1).…”
Section: Methodsmentioning
confidence: 99%
“…The dataset is subsequently processed to optimize the performance of the machine learning model. The initial stage of data processing involves removing all null or NaN data, a process commonly referred to as "handling missing data" [26]. The next step is converting categorical data into numerical data.…”
Section: Methods 21 Data Preparationmentioning
confidence: 99%
“…For instance, the K-nearest neighbor is widely used because of its simplicity and high performance. However, K-NN performs poorly in large datasets and high-dimensional data contexts [47]. Additionally, many researchers stumble while linking different features in a dataset.…”
Section: Rq4 What Are the Measurement Factors Used To Evaluate The Ml...mentioning
confidence: 99%
“…Hence, clinical researchers should be aware not to exceed this threshold while simulating missing values. Furthermore, outliers can also affect the imputation performance by causing a large RMSE if not handled correctly by deleting or replacing them [47].…”
Section: E Rq5 What Are the Limitations And Strength Points In Applyi...mentioning
confidence: 99%
See 1 more Smart Citation