Dealing with Imbalanced Dataset Leveraging Boundary Samples Discovered by Support Vector Data Description

Luo, Zhengbo; Parv飊, Ham飀; Garg, Harish; Pho, Kim-Hung

doi:10.32604/cmc.2021.012547

Cited by 11 publications

(4 citation statements)

References 63 publications

(67 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the dataset is unbalanced and the MSE function is used as the loss function, it makes the machine learning model more inclined to predict classes with large sample numbers [41]. Data sampling is commonly used in many studies to address data imbalance, but such methods can have an unexpected impact on the data: undersampling can result in missing data, oversampling is blind in generating the data, and data sampling, in general, can easily marginalize data [42][43][44]. In this study, we attempted to use a focal loss (FL) [45] to address the data imbalance problem.…”

Section: Applying the Cma-es To Train The Dnmmentioning

confidence: 99%

Wart-Treatment Efficacy Prediction Using a CMA-ES-Based Dendritic Neuron Model

et al. 2023

View full text Add to dashboard Cite

Warts are a prevalent condition worldwide, affecting approximately 10% of the global population. In this study, a machine learning method based on a dendritic neuron model is proposed for wart-treatment efficacy prediction. To prevent premature convergence and improve the interpretability of the model training process, an effective heuristic algorithm, i.e., the covariance matrix adaptation evolution strategy (CMA-ES), is incorporated as the training method of the dendritic neuron model. Two common datasets of wart-treatment efficacy, i.e., the cryotherapy dataset and the immunotherapy dataset, are used to verify the effectiveness of the proposed method. The proposed CMA-ES-based dendritic neuron model achieves promising results, with average classification accuracies of 0.9012 and 0.8654 on the two datasets, respectively. The experimental results indicate that the proposed method achieves better or more competitive prediction results than six common machine learning models. In addition, the trained dendritic neuron model can be simplified using a dendritic pruning mechanism. Finally, an effective wart-treatment efficacy prediction method based on a dendritic neuron model, which can provide decision support for physicians, is proposed in this paper.

show abstract

Section: Applying the Cma-es To Train The Dnmmentioning

confidence: 99%

Wart-Treatment Efficacy Prediction Using a CMA-ES-Based Dendritic Neuron Model

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Unlike under-sampling method, oversampling method balances the class distribution by replicating minority class samples until it can reach the number of majority class samples. For example, these oversampling methods in Zhu et al (2021), Luo et al (2021), Amit and Chinmay (2021) and Li et al (2021) achieve ideal classification results, but they can only learn on specific decision regions of the replicated data, which limits the learning ability of a classifier.…”

Section: Related Workmentioning

confidence: 99%

A novel twin-support vector machine for binary classification to imbalanced data

Chao²

2023

DTA

View full text Add to dashboard Cite

PurposeBinary classification on imbalanced data is a challenge; due to the imbalance of the classes, the minority class is easily masked by the majority class. However, most existing classifiers are better at identifying the majority class, thereby ignoring the minority class, which leads to classifier degradation. To address this, this paper proposes a twin-support vector machines for binary classification on imbalanced data.Design/methodology/approachIn the proposed method, the authors construct two support vector machines to focus on majority classes and minority classes, respectively. In order to promote the learning ability of the two support vector machines, a new kernel is derived for them.Findings(1) A novel twin-support vector machine is proposed for binary classification on imbalanced data, and new kernels are derived. (2) For imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned by using optimizing kernels. (3) Classifiers based on twin architectures have more advantages than those based on single architecture for binary classification on imbalanced data.Originality/valueFor imbalanced data, the complexity of data distribution has negative effects on classification results; however, advanced classification results can be gained and desired boundaries are learned through using optimizing kernels.

show abstract

“…C-SMOTE [26], an over-sampling method based on clustering, clusters positive and negative classes separately which cannot only solve the problem of imbalance between classes, but also solve the problem of imbalance within classes. Re-sampling with easily misjudged boundary samples found by Support Vector Data Description (SVDD) [27]. Such as similar resampling methods [28,29].…”

Section: Unbalanced Data Processingmentioning

confidence: 99%

Data Replay Method for Detecting Fraud Concept Drift in Online Transactions

Zhang

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

The methods of online transaction fraud have never-ending changes and improvement. The classification accuracy of the static fraud detection model is reduced and the generalization ability of the model cannot be guaranteed. The focus of online transaction fraud detection is to make the classifier adapt to the new and old fraud concepts at the same time and to fully learn enough fraud characteristics. Therefore, this paper proposes a data stream classification algorithm based on cosine similarity to replay data (CSDR). Compare the cosine similarity between the data distribution after replaying the fraud concept data and the currently known data distribution to determine the amount of replay data at each moment of concept drift. Retain as much of the data distribution of the fraud concept as possible. To solve the problem of imbalance within the class, use the clustering over-sampling method to balance the dataset. Experiments on the credit card transaction data set show that the CSDR algorithm uses a single classifier to adapt to sudden and recurring concept drift. It has a higher average accuracy rate, lower replay data volume and model update time.

show abstract

Dealing with Imbalanced Dataset Leveraging Boundary Samples Discovered by Support Vector Data Description

Cited by 11 publications

References 63 publications

Wart-Treatment Efficacy Prediction Using a CMA-ES-Based Dendritic Neuron Model

Wart-Treatment Efficacy Prediction Using a CMA-ES-Based Dendritic Neuron Model

A novel twin-support vector machine for binary classification to imbalanced data

Data Replay Method for Detecting Fraud Concept Drift in Online Transactions

Contact Info

Product

Resources

About