Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.
Imbalanced datasets exist commonly in the real world, which leads to poor performance of general machine learning models because of skewed class distribution. To address the data‐imbalance problem, a novel oversampling method based on classification contribution degree, called OS‐CCD is presented. First a new concept, classification contribution degree, is established based on micro and macro information extracted from raw datasets. With the classification contribution degree, OS‐CCD enables positive samples near the class boundary and located in an area with high density of positive samples to generate more synthetic samples than others. Furthermore, the neighbor selection for oversampling is no longer random but in the light of a selected probability. Experimental results on 12 benchmark datasets substantiate that four commonly used classifiers with the oversampling method outperform those with six popular oversampling methods in terms of accuracy, F1‐score and AUC.
Data-imbalanced problems are present in many applications. A big gap in the number of samples in different classes induces classifiers to skew to the majority class and thus diminish the performance of learning and quality of obtained results. Most data level imbalanced learning approaches generate new samples only using the information associated with the minority samples through linearly generating or data distribution fitting. Different from these algorithms, we propose a novel oversampling method based on generative adversarial networks (GANs), named OS-GAN. In this method, GAN is assigned to learn the distribution characteristics of the minority class from some selected majority samples but not random noise. As a result, samples released by the trained generator carry information of both majority and minority classes. Furthermore, the central regularization makes the distribution of all synthetic samples not restricted to the domain of the minority class, which can improve the generalization of learning models or algorithms. Experimental results reported on 14 datasets and one high-dimensional dataset show that OS-GAN outperforms 14 commonly used resampling techniques in terms of G-mean, accuracy and F1-score.
In the age of big data, machine learning models are globally used to execute default risk prediction. Imbalanced datasets and redundant features are two main problems that can reduce the performance of machine learning models. To address these issues, this study conducts an analysis from the viewpoint of different balance ratios as well as the selection order of feature selection. Accordingly, we first use data rebalancing and feature selection to obtain 32 derived datasets with varying ratios of balance and feature combinations for each dataset. Second, we propose a comprehensive metric model based on multimachine learning algorithms (CMM-MLA) to select the best-derived dataset with the optimal balance ratio and feature combination. Finally, the convolutional neural network (CNN) is trained on the selected derived dataset to evaluate the performance of our approach in terms of type-II error, accuracy, G-mean, and AUC. There are two contributions in this study. First, the optimal balance ratio is found through the classification accuracy, which changes the deficiency of the existing research that samples are imbalanced or the balance ratio is 1 : 1 and ensures the accuracy of the classification model. Second, a comprehensive metric model based on the machine learning algorithm is proposed, which can simultaneously find the best balance ratio and the optimal feature selection. The experimental results show that our method can noticeably improve the performance of CNN, and CNN outperforms the other four commonly used machine learning models in the task of default risk prediction on four benchmark datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.