2021
DOI: 10.1109/access.2021.3110745
|View full text |Cite
|
Sign up to set email alerts
|

Algorithmic Splitting: A Method for Dataset Preparation

Abstract: Datasets that appear in publications are curated and split into training, testing, and validation sub-datasets by domain experts. Consequently, machine learning models typically perform well on such split-by-hand datasets, whereas preparing real-world datasets into curated splits, i.e., training, testing, and validation sub-datasets, require extensive effort. Usually, random repetitive splitting is carried out, practiced, and evaluated until a better score is reached on the evaluation metrics. In this paper, a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(12 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…Therefore, relying on a single dataset split to create a model may pose challenges to the model’s representativeness and reliability. The performance of models established on small datasets would be obviously impacted by dataset size, split ratio, and split strategy, and machine learning may not capture the full range of features and patterns present in the given training set. This underscores the importance of evaluating model performance and interpreting what a model learned across various splits.…”
Section: Resultsmentioning
confidence: 99%
“…Therefore, relying on a single dataset split to create a model may pose challenges to the model’s representativeness and reliability. The performance of models established on small datasets would be obviously impacted by dataset size, split ratio, and split strategy, and machine learning may not capture the full range of features and patterns present in the given training set. This underscores the importance of evaluating model performance and interpreting what a model learned across various splits.…”
Section: Resultsmentioning
confidence: 99%
“…The practice of dimensionality reduction followed by clustering is common for large input data and has been applied to SAR data sets (Van de Kerkhof et al., 2020), and for a wide range of other data types (Fernández Llamas et al., 2019; R. Harrison et al., 2019; Kahloot & Ekler, 2019). T‐SNE is a dimensionality reduction method that can group similarly behaving time series of height measurements of the different reflection points (Van der Maaten & Hinton, 2008).…”
Section: Methodsmentioning
confidence: 99%
“…To determine the quality of the trained model, we use the held-out validation datasets. You may learn a lot about the model's performance from its validation [40]. Absolute error and mean absolute error are two metrics used to measure the quality of the model, which allows for testing and comparison of multiple models [41,42].…”
Section: Modelmentioning
confidence: 99%