A Comparative Study of Various Methods for Handling Missing Data in UNSODA

Fu, Yingpeng; Liao, Hongjian; Lv, Longlong

doi:10.3390/agriculture11080727

Cited by 10 publications

(6 citation statements)

References 61 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The number of fields sampled per farm were not uniform throughout, and the managerial variables were inconsistently recorded each year. As is common practice within this research area (Fu et al., 2021), data with complete sets of values were used for statistical analysis, carried out using R 4.0.2 (R Core Team, 2021). The dataset finally used for management intensity analysis therefore included 16 farms, 43 fields and a total of 188 observations across the 10‐year timespan (Table S1).…”

Section: Methodsmentioning

confidence: 99%

Long‐term effects of management intensity and bioclimatic variables on leatherjacket (Tipula paludosa Meigen) populations at farm scale

Moffat,

Cole,

Lacey

et al. 2024

J Applied Entomology

View full text Add to dashboard Cite

Leatherjackets (Tipula spp.) are soil‐dwelling pests associated with agriculture. Land management decisions made at farm scale can have subsequent effects on their populations. Between 1980 and 2020, surveys were conducted across Scotland to collect field histories and larval population data from grassland farms. To assess the impact of management and bioclimatic factors on leatherjacket occurrence over time, this study investigated data from fields continuously sampled between 2009 and 2018. We utilized a Generalized Linear Mixed‐Effect Model on a dataset of 61 fields on 19 farms. Results indicated three significant factors affecting larval populations; field size, grazing type and application of insecticides or herbicides (referred to collectively as pesticides). Larval populations were significantly lower in fields that were larger in size and under sheep grazing, compared to no grazing. Pesticide application also caused a significant reduction in larval populations. Management variables were amalgamated to create a Management Intensity Index, revealing significantly increased larval populations under low‐management systems. These results, coupled with significant effects of bioclimatic variables, pinpoint predictive signals for high infestations and potential routes for control strategies.

show abstract

Section: Methodsmentioning

confidence: 99%

Long‐term effects of management intensity and bioclimatic variables on leatherjacket (Tipula paludosa Meigen) populations at farm scale

Moffat,

Cole,

Lacey

et al. 2024

J Applied Entomology

View full text Add to dashboard Cite

show abstract

“…The dataset is subsequently processed to optimize the performance of the machine learning model. The initial stage of data processing involves removing all null or NaN data, a process commonly referred to as "handling missing data" [26]. The next step is converting categorical data into numerical data.…”

Section: Methods 21 Data Preparationmentioning

confidence: 99%

Comparison of Classification and Regression Model Approaches on the Main Causes of Stroke with Symbolic Regression Feyn Qlattice

Purwono,

Agung Budi Prasetio,

Burhanuddin bin Mohd Aboobaider

2023

JAHIR

View full text Add to dashboard Cite

Stroke is one of the deadliest diseases in the world, caused by damage to brain tissue resulting from a blockage in the cerebrovascular system. Proper treatment is essential to avoid worsening complications in patients. Several main triggering factors for stroke include hypertension, obesity, smoking habits, lack of physical activity, excessive alcohol consumption, diabetes, and high cholesterol levels. The advancement of information technology allows for early disease prediction through the utilization of AI and Machine Learning technology. The vast amount of data available on medical and health services worldwide can be maximized to identify risk factors for various diseases, including stroke. Machine learning techniques can be employed to predict the causes of stroke. In this study, we were inspired to use the Feyn Qlattice model approach to address stroke. Both classification and regression models were tested in this study. The results indicate that the classification model performs better, achieving an accuracy rate of 0.95. In contrast, the regression model yielded less satisfactory results, with R2, MAE, and RMSE values considered inadequate. This conclusion is supported by the regression plot and residual plot, both of which indicate suboptimal performance. Hence, maximizing the use of the Feyn Qlattice regression model in datasets related to the causes of stroke is recommended

show abstract

“…For instance, the K-nearest neighbor is widely used because of its simplicity and high performance. However, K-NN performs poorly in large datasets and high-dimensional data contexts [47]. Additionally, many researchers stumble while linking different features in a dataset.…”

Section: Rq4 What Are the Measurement Factors Used To Evaluate The Ml...mentioning

confidence: 99%

“…Hence, clinical researchers should be aware not to exceed this threshold while simulating missing values. Furthermore, outliers can also affect the imputation performance by causing a large RMSE if not handled correctly by deleting or replacing them [47].…”

Section: E Rq5 What Are the Limitations And Strength Points In Applyi...mentioning

confidence: 99%

“…Another important factor is the dataset, which is specified by its size, type, and level of correlation between variables. For instance, deep learning techniques work better with large datasets, while support vector machines and K-NN achieve high performance in small datasets [47]. Different datasets have different data types, such as time-series datasets, which work better with recurrent neural networks [108].…”

Section: ) Taxonomy Of ML In Imputing Missing Valuesmentioning

confidence: 99%

See 1 more Smart Citation

Systematic Review of Using Machine Learning in Imputing Missing Values

et al. 2022

View full text Add to dashboard Cite

Missing data are a universal data quality problem in many domains, leading to misleading analysis and inaccurate decisions. Much research has been done to investigate the different mechanisms of missing data and the proper techniques in handling various data types. In the last decade, machine learning has been utilized to replace conventional methods to address the problem of missing values more efficiently. By studying and analyzing recently proposed methods using machine learning approaches, vital adoptions in accuracy, performance, and time consumed can be highlighted. This study aimed to help data analysts and researchers address the limitations of machine learning imputation methods by conducting a systematic literature review to provide a comprehensive overview of using such methods to impute missing values. Novel proposed machine learning approaches used for data imputation are analyzed and summarized to assist researchers in selecting a proper machine learning method based on several factors and settings. The review was performed on research studies published between 2016 and 2021 on adopting machine learning to impute missing values, focusing on their strengths and limitations. A total of 684 research articles from various scientific databases were analyzed using search engines, and 94 of them were selected as primary studies. Finally, several recommendations were given to guide future researchers in applying machine learning to impute missing values.INDEX TERMS Systematic literature review, data imputation, data mining, missingness, data preprocessing, data quality.

show abstract

A Comparative Study of Various Methods for Handling Missing Data in UNSODA

Cited by 10 publications

References 61 publications

Long‐term effects of management intensity and bioclimatic variables on leatherjacket (Tipula paludosa Meigen) populations at farm scale

Long‐term effects of management intensity and bioclimatic variables on leatherjacket (Tipula paludosa Meigen) populations at farm scale

Comparison of Classification and Regression Model Approaches on the Main Causes of Stroke with Symbolic Regression Feyn Qlattice

Systematic Review of Using Machine Learning in Imputing Missing Values

Contact Info

Product

Resources

About