Stop Oversampling for Class Imbalance Learning: A Review

Tarawneh, Ahmad S.; Hassanat, Ahmad B. A.; Almuhaimeed, Abdullah

doi:10.1109/access.2022.3169512

“…The number of nearest neighbors is a hyper-parameter for SMOTE and is usually selected based on performance over a specific metric, e.g., model accuracy on the validation set. SMOTE, however, does not consider the underlying class distribution and is prune to over-fitting, class overlap, and noisy sample generation [14]. Although a number of potential improvements over SMOTE have been proposed, (see [14] and references therein), over time, SMOTE has become more popular among researchers and has been seen as the default upsampling method in the field of imbalanced learning.…”

Section: Appendix a Synthetic Minority Oversampling Techniquementioning

confidence: 99%

“…SMOTE, however, does not consider the underlying class distribution and is prune to over-fitting, class overlap, and noisy sample generation [14]. Although a number of potential improvements over SMOTE have been proposed, (see [14] and references therein), over time, SMOTE has become more popular among researchers and has been seen as the default upsampling method in the field of imbalanced learning. Thus, we chose to include SMOTE as a benchmark method in our analysis.…”

Section: Appendix a Synthetic Minority Oversampling Techniquementioning

confidence: 99%

“…Thus these methods might over-fit to the available minority samples. Consequently, the models trained on synthetically generated data may fail all-together when facing the real-world problem [14]. Another major problem with the synthetically generated samples is that they may belong to a different class in the real world, even if they are similar to the minority samples.…”

mentioning

confidence: 99%

“…Recently, authors in [14] have compared the performance of 72 traditional upsampling methods on 9 real-world imbalanced datasets and have shown that a high number of synthetic minority samples do, in fact, belong to the majority class. The authors argued against using synthetic samples for model training altogether.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Are Classifiers Trained on Synthetic Data Reliable? An XAI Study

IQBAL¹,

Sikdar²

2022

Preprint

1

0

View full text Add to dashboard Cite

Machine learning (ML) solutions are being applied in many areas of our daily lives, but they often require high-quality, balanced datasets in order to perform well. However, datasets for real-world problems are often imbalanced, requiring the use of special-purpose ML algorithms or synthetic data to address the class imbalance. Traditional techniques such as Synthetic Minority Oversampling Technique (SMOTE) and generative models such as Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN) are commonly used to generate minority class samples. Evaluating the quality of synthetic samples can be challenging, and researchers often rely on improved classifier performance as justification for their use. However, simply performing well on a test set is not sufficient to ensure the trustworthiness of a model, and further analysis of model predictions is necessary. To address this, we trained multiple classifiers on synthetic data generated by various methods and analyzed their predictions using SHAP, an explainable AI technique. Our in-depth analysis showed that these classifiers used different features for making predictions and placed different levels of importance on commonly used features. Therefore, we conclude that classification models trained on synthetic data must be carefully analyzed by human experts before being deployed in the real world.

show abstract

“…There are a number of AE variants in literature [14], out of which the Variational Autoencoders (VAEs) [12] are the most popular. In vanilla AE, the encoder outputs a single latent vector z directly, however, in VAEs, the encoder outputs two vectors; a mean vector µ and a variance vector σ.…”

Section: Appendix B Variational Autoencodersmentioning

confidence: 99%

Are Classifiers Trained on Synthetic Data Reliable? An XAI Study

IQBAL¹,

Sikdar²

2022

Preprint

0

View full text Add to dashboard Cite

Machine learning (ML) solutions are being applied in many areas of our daily lives, but they often require high-quality, balanced datasets in order to perform well. However, datasets for real-world problems are often imbalanced, requiring the use of special-purpose ML algorithms or synthetic data to address the class imbalance. Traditional techniques such as Synthetic Minority Oversampling Technique (SMOTE) and generative models such as Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN) are commonly used to generate minority class samples. Evaluating the quality of synthetic samples can be challenging, and researchers often rely on improved classifier performance as justification for their use. However, simply performing well on a test set is not sufficient to ensure the trustworthiness of a model, and further analysis of model predictions is necessary. To address this, we trained multiple classifiers on synthetic data generated by various methods and analyzed their predictions using SHAP, an explainable AI technique. Our in-depth analysis showed that these classifiers used different features for making predictions and placed different levels of importance on commonly used features. Therefore, we conclude that classification models trained on synthetic data must be carefully analyzed by human experts before being deployed in the real world.

show abstract