A New Under-Sampling Method to Face Class Overlap and Imbalance

Guzmán-Ponce, A.; Valdovinos, R. M.; Sánchez, J. Salvador; Marcial‐Romero, J. Raymundo

doi:10.3390/app10155164

Cited by 31 publications

(16 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In general, most real-world data include various types of noises that might affect the performance in learning. Particularly, in imbalanced classification problems, it is known that the decision boundary can make the samples distinguishable more clearly after identifying and eliminating noise samples in overlapped regions (Fotouhi, Asadi, and Kattan 2019;Guzmán-Ponce et al 2020). Thus, several useful methods have been developed to identify and eliminate the noises in imbalanced classification problems, especially, which are close to the decision boundary.…”

Section: Anomaly Detection Methodsmentioning

confidence: 99%

“…Karami and Johansson (2014) provided an efficient hybrid clustering method called BDE-DBSCAN, which combines the binary differential evolution (BDE) method and DBSCAN algorithm to determine appropriate parameter values of ε and MinPts quickly and automatically (Karami and Johansson 2014). Guzmán-Ponce et al (2020) proposed an under-sampling method called DBMIST-US that combines DBSCAN and minimum spanning tree (MST) algorithm for identifying noisy samples and cleaning borderline samples (i.e., the samples close to the decision boundary) sequentially (Guzmán-Ponce et al 2020).…”

Section: Clustering Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A New Hybrid Under-sampling Approach to Imbalanced Classification Problems

Peng

Park

2021

Applied Artificial Intelligence

View full text Add to dashboard Cite

Among many machine learning applications, classification is one of the important tasks. Most classification algorithms have been designed under the assumption that the number of samples for each class is approximately balanced. However, if the conventional classification approaches are applied to a class imbalanced dataset, it is likely to cause misclassification and, as a result, may distort classification performance results. Thus, in this study, we consider imbalanced classification problems and adopt an efficient preprocessing technique to improve the classification performances. In particular, we focus on borderline noise and outlier samples that belong to the majority class since they may influence classification performance. For this, we propose a hybrid resampling method, called BOD-based undersampling, which is based on density-based spatial clustering of applications with noise (DBSCAN) approach as well as noise and outlier detection methods, that is, borderline noise factor (BNF) and outlierness based on neighborhood (OBN) to divide majority class samples into four distinctive categories, i.e., safe, borderline noise, rare, and outlier. Specifically, we first determine the borderline noise samples in the overlapped region using the BNF method. Secondly, we use the OBN method to detect outlier samples and apply the DBSCAN approach to cluster the samples. Based on the results obtained from the sample identification analysis, we then segregate the safe category samples which are not abnormal samples while keeping the rest of the samples as rare samples. Finally, we remove some of safe samples by using the random under-sampling (RUS) method and verify the effectiveness of the proposed algorithm through the comprehensive experimental analysis with considering several class imbalance datasets.

show abstract

Section: Anomaly Detection Methodsmentioning

confidence: 99%

Section: Clustering Methodsmentioning

confidence: 99%

A New Hybrid Under-sampling Approach to Imbalanced Classification Problems

Peng

Park

2021

Applied Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…It randomly deletes some samples from the majority samples to achieve the same number as minority samples. Guzmán-Ponce et al [30] proposed a two-stage under-sampling method, which combined the DBSCAN [31] clustering algorithm with a minimum spanning tree algorithm to handle class overlap and imbalance simultaneously. Koziarski [32] proposed a method named CSMOUTE, which defined synthetic minority under-sampling by incorporating the two nearest majority instances.…”

Section: The Review Of Data Samplingmentioning

confidence: 99%

Forecasting the yield of wafer by using improved genetic algorithm, high dimensional alternating feature selection and SVM with uneven distribution and high-dimensional data

Wang

2022

Auton. Intell. Syst.

View full text Add to dashboard Cite

Wafer yield prediction, as the basis of quality control, is dedicated to predicting quality indices of the wafer manufacturing process. In recent years, data-driven machine learning methods have received a lot of attention due to their accuracy, robustness, and convenience for the prediction of quality indices. However, the existing studies mainly focus on the model level to improve the accuracy of yield prediction does not consider the impact of data characteristics on yield prediction. To tackle the above issues, a novel wafer yield prediction method is proposed, in which the improved genetic algorithm (IGA) is an under-sampling method, which is used to solve the problem of data overlap between finished products and defective products caused by the similarity of manufacturing processes between finished products and defective products in the wafer manufacturing process, and the problem of data imbalance caused by too few defective samples, that is, the problem of uneven distribution of data. In addition, the high-dimensional alternating feature selection method (HAFS) is used to select key influencing processes, that is, key parameters to avoid overfitting in the prediction model caused by many input parameters. Finally, SVM is used to predict the yield. Furthermore, experiments are conducted on a public wafer yield prediction dataset collected from an actual wafer manufacturing system. IGA-HAFS-SVM achieves state-of-art results on this dataset, which confirms the effectiveness of IGA-HAFS-SVM. Additionally, on this dataset, the proposed method improves the AUC score, G-Mean and F1-score by 21.6%, 34.6% and 0.6% respectively compared with the conventional method. Moreover, the experimental results prove the influence of data characteristics on wafer yield prediction.

show abstract

“…Roy et al [ 32 ], combine both SMOTE-Tomek to balance the Pima diabetes dataset using ANN and had achieved accuracy of 98%. Guzmán-Ponce et al [ 11 ] proposed two undersampling strategies that combine DBSCAN clustering to eliminate noisy samples and refine the decision boundary with a minimal spanning tree (MSA) algorithm to deal with the class imbalance.…”

Section: Related Workmentioning

confidence: 99%

“…Most resampling methods rely on the k nearest neighbor (KNN) rule [ 7 , 10 ], either by eliminating instances of two classes that are far from the decision boundary to reduce duplication as in condensing or by removing those that are close to the boundary for generalization as in filtering [ 11 ]. Similarly, Tomek-links are used to eliminate instances from the majority class since, if two examples form a Tomek link, then either one of them is noise or both are borderline.…”

Section: Introductionmentioning

confidence: 99%

Predictive Analysis of Diabetes-Risk with Class Imbalance

ElSeddawy

Karim

Hussein

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Diabetes type 2 (T2DM) is a common chronic disease, increasingly leading to many complications and affecting vital organs. Hyperglycemia is the main characteristic caused by insufficient insulin secretion and poses a serious risk to human health. The objective is to construct a type-2 diabetes prediction model with high classification accuracy. Advanced machine learning and predictive model techniques are utilized to achieve cutting-edge techniques for the early diagnosis of diabetes. This paper proposes an efficient performance model to predict and classify the minority class of type-2 diabetes. The impact of oversampling and undersampling approaches to reduce the effect of an unbalanced class has been compared to classification performance algorithms. Synthetic Minority Oversampling (SMOTE) and Tomek-links techniques are applied and examined. The outcomes were then compared to the original unbalanced dataset using an artificial neural network (ANN) predictive model. The model is compared with other state-of-the-art classifiers such as support vector machine (SVM), random forest (RF), and decision tree (DT). The tuned model had the best accuracy of 92.2%. The experimental findings clearly manifest the improvement in accuracy and evaluation metrics in terms of AUC and F1-measure using the SMOTE oversampling strategy rather than the baseline and undersampling schemes. The study recommends adopting dynamic hyperparameter optimization to further improve accuracy.

show abstract

A New Under-Sampling Method to Face Class Overlap and Imbalance

Cited by 31 publications

References 56 publications

A New Hybrid Under-sampling Approach to Imbalanced Classification Problems

A New Hybrid Under-sampling Approach to Imbalanced Classification Problems

Forecasting the yield of wafer by using improved genetic algorithm, high dimensional alternating feature selection and SVM with uneven distribution and high-dimensional data

Predictive Analysis of Diabetes-Risk with Class Imbalance

Contact Info

Product

Resources

About