2022
DOI: 10.1002/widm.1456
|View full text |Cite
|
Sign up to set email alerts
|

Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting

Abstract: This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution char… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 58 publications
0
5
0
Order By: Relevance
“…Currently, the most widely used dataset is the publicly available Pima Indian dataset, which includes 768 data results with nine feature variables. Many studies (18)(19)(20)(21)(22)(23)(24)(25)(26) have established Available online at: https://www.kaggle.com/datasets/andrewmvd/ early-diabetes-classification.…”
Section: Introductionmentioning
confidence: 99%
“…Currently, the most widely used dataset is the publicly available Pima Indian dataset, which includes 768 data results with nine feature variables. Many studies (18)(19)(20)(21)(22)(23)(24)(25)(26) have established Available online at: https://www.kaggle.com/datasets/andrewmvd/ early-diabetes-classification.…”
Section: Introductionmentioning
confidence: 99%
“…decision makers, environmental consultants and conservation managers) are often limited because of data sensitivity or ownership issues, although more and more programs contain data that are publicly available and use of them can be made without any particular attention to their quality (Costello and Wieczorek, 2014; Tittensor et al, 2014) and they are generally unfamiliar to SEA stakeholders. Surprisingly however, despite the prevailing recognition of the “garbage in – garbage out” that emphasises the critical importance of the quality of data (Sanders and Saxe, 2017; Canbek, 2022), an examination of data suitability is relatively rare in local conservation planning (Rondinini et al, 2006; Hermoso et al, 2015a). In this context, some authors argue the necessity of examining the sensitivity of model results to the nature of the datasets that are used (Sanders and Saxe, 2017; Clare et al, 2019; Velazco et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…On the other hand, it has been found that even the best model can be tricked by poor data quality. 19–22 For example, in malware detection it was found that ML-based models can fail if the training data does not contain the event the model had been designed for. 19,21 The notion of underperforming models trained on low-quality data (“garbage in-garbage out”) can be traced back to Charles Babbage.…”
Section: Introductionmentioning
confidence: 99%
“…19–22 For example, in malware detection it was found that ML-based models can fail if the training data does not contain the event the model had been designed for. 19,21 The notion of underperforming models trained on low-quality data (“garbage in-garbage out”) can be traced back to Charles Babbage. 23 The ML community is starting to notice the importance of data quality used for training and the relevance to balance amount of data (“big data”) versus quality of data.…”
Section: Introductionmentioning
confidence: 99%