Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

García, Vicente; Sánchez, J. Salvador; Marqués, A. I.; Florencia, Rogelio; Rivera, Gilberto

doi:10.1016/j.eswa.2019.113026

Cited by 69 publications

(28 citation statements)

References 94 publications

(98 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Over and under sampling strategies are very popular and effective approaches to deal with the class imbalance problem [21,25,33,50]. To compensate the class imbalance by biasing the process of discrimination, the ROS algorithm randomly replicates samples from the minority classes while the RUS technique randomly eliminates samples from the majority classes, until achieving a relative classes balance [23,60].…”

Section: Sampling Class Imbalance Approachesmentioning

confidence: 99%

“…Hybrid methods generally employ SMOTE to compensate the class imbalance, because this method reduces the possibilities of over-training or over-fitting [1]. They use methods based in nearest neighbor rule to reduce overlap or noise in the dataset [25].…”

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

“…Recently, hybrid methods have increased in popularity in the machine learning community, as it can be observed in several references [23,25,66], which have studied SMOTE+TL, SMOTE+CNN and SMOTE+ENN, among others methods, to deal with class imbalance problem and to eliminate samples in the overlapping region, in order to improve the classifier performance.…”

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

“…SMOTE+TL combines SMOTE and TL [64], SMOTE+CNN performs SMOTE and then CNN [65]. Figure 1a describes the operation of these hybrid approaches, which have been widely applied to deal with the class imbalance problem [22,25,29,33,50,[63][64][65]67]. The effectiveness of an additional hybrid approach is studied in this work, in which the training dataset is cleaned or reduced, and after, the number of samples by class in the training dataset is balanced.…”

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

“…Nowadays, it is one the most popular data sampling methods [1] and it has motivated the development of other over-samplings algorithms [24]. Similarly, under-sampling methods have been incorporate a heuristic component [25], some of the most outstanding examples being the Tomek's Links (TL) [26], Editing Nearest Neighbor (ENN) [27], and Condensed Nearest Neighbor rule (CNN) [28], among others [22,[29][30][31][32][33].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

et al. 2020

View full text Add to dashboard Cite

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek's Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier's nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

show abstract

Section: Sampling Class Imbalance Approachesmentioning

confidence: 99%

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

Section: Hybrid Sampling Class Imbalance Strategiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

et al. 2020

View full text Add to dashboard Cite

show abstract

Effects of classification, feature selection, and resampling methods on bankruptcy prediction of small and medium‐sized enterprises

Papíková

Papík

2022

Intell Sys Acc Fin Mgmt

View full text Add to dashboard Cite

Small and medium-sized enterprises are the pillars of an economy, and their poor performance has a negative impact on living standards of population and country development. This study analyzes real-life data of 89,851 small and medium-sized enterprises, out of which 295 have declared bankruptcy. The analysis is performed via 27 financial ratios. The study framework combines seven classifications and three resampling and seven feature selection methods. Out of all classification methods applied, CatBoost has achieved the best results for all combinations of resampling and feature selection methods. CatBoost surpassed the results of other classification methods for the area under curve parameter, achieving a value of 99.95%. The application of resampling methods on different classification models has not identified a statistically significant level of improvement in any of the resampling methods. This finding has also been observed for feature selection methods. Based on these findings, we assume that individual resampling and feature selection methods do not improve model performance compared with the original imbalanced sample's results.Our results suggest that, even though the data sample may be significantly imbalanced with a minority of bankrupt companies, most classification algorithms can handle this imbalance and achieve interesting results. Moreover, our findings provide broad practical application for all stakeholders who could need to detect bankrupting companies.

show abstract