2021
DOI: 10.3390/app11188546
|View full text |Cite
|
Sign up to set email alerts
|

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Abstract: In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(13 citation statements)
references
References 68 publications
(88 reference statements)
0
11
0
Order By: Relevance
“…The selection of the best re-sampling technique is complicated. Since the effectiveness of re-sampling techniques depends on intrinsic properties of the dataset, such as dataset size and dimensionality, imbalance ratio, overlapping between classes or borderline samples (41). In the present study, the majority class and minority class have close properties, such as both of them are clinical T1 stage lung adenocarcinoma.…”
Section: Discussionmentioning
confidence: 72%
“…The selection of the best re-sampling technique is complicated. Since the effectiveness of re-sampling techniques depends on intrinsic properties of the dataset, such as dataset size and dimensionality, imbalance ratio, overlapping between classes or borderline samples (41). In the present study, the majority class and minority class have close properties, such as both of them are clinical T1 stage lung adenocarcinoma.…”
Section: Discussionmentioning
confidence: 72%
“…Imbalanced dataset is a common problem in machine learning classification. This imbalanced data can prevent the machine learning algorithms from building accurate models for these minority classes and lead to prediction errors [25,34]. For example, Sim-pleLogistic worked better than decision tree with sampling methods for the datasets of obstetrics and gynecology and urology, but not for neurosurgery dataset.…”
Section: Discussionmentioning
confidence: 99%
“…For example, Sim-pleLogistic worked better than decision tree with sampling methods for the datasets of obstetrics and gynecology and urology, but not for neurosurgery dataset. There are several methods to solve this problem of imbalanced data, such as resampling the datasets by under-sampling the majority class and over-sampling the minority class, modifying algorithms, and considering a different perspective, such as anomaly [24,25,34]. We used two resampling approaches (Bagging and AdaBoost) to overcome the problem of imbalanced dataset.…”
Section: Discussionmentioning
confidence: 99%
“…One of them is the health area since it is a rich data source, including electronic medical records, administrative reports, and medical imaging among others (11,12) . There are numerous studies in the literature, including from our group (13)(14)(15)(16) , in which different machine learning algorithms have been used for various purposes such as the automation of medical diagnosis, and prediction of mortality or treatment outcomes.…”
Section: Introductionmentioning
confidence: 99%