Deep Generative Models to Counter Class Imbalance: A Model-Metric Mapping With Proportion Calibration Methodology

Mirza, Behroz; Haroon, Danish; Khan, Behraj; Padhani, Ali; Syed, Tahir Q.

doi:10.1109/access.2021.3071389

Cited by 13 publications

(8 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The core idea behind generative modeling in the context of tackling the class imbalance problem is to estimate the probability density function describing the data and generate new data instances in a random fashion [39] in order to balance the data distribution in an otherwise imbalanced dataset. Generative models typically construct a latent space that aims to capture the direct cause of the target variable.…”

Section: ) Generative Modelingmentioning

confidence: 99%

Learning From Few Cyber-Attacks: Addressing the Class Imbalance Problem in Machine Learning-Based Intrusion Detection in Software-Defined Networking

Mirsadeghi,

Bahsi,

Vaarandi

et al. 2023

IEEE Access

View full text Add to dashboard Cite

The class imbalance problem negatively impacts learning algorithms' performance in minority classes which may constitute more severe attacks than the majority ones. This study investigates the benefits of balancing strategies and imbalanced learning approaches on intrusion data from Software Defined Networking (SDN). Although the research community has covered the imbalance problem in machine learningbased intrusion detection, addressing this problem in SDN is novel and powerful. Addressing the class imbalance problem over InSDN (the only publicly available SDN intrusion detection dataset as of recent) is of significant impact on future research in the area of intrusion detection in SDN. We address the class imbalance problem through data-level and classifier-level techniques. Our research objective is to determine suitable methods of addressing the class imbalance problem in machine learning-based intrusion detection in SDN. We propose custom deep learning architectures based on GANs and Siamese Neural Networks for generative modeling and similarity-based intrusion detection. This paper provides benchmarking results from classification with Random Oversampling (ROS), SMOTE, GANs, weighted Random Forest, and Siamesebased one-shot learning. We have found that Random Forest (RF) outperforms deep learning models in the classification of minority class instances. This supports the notion that RF can handle class imbalance well. We also observe that widely-used balancing techniques, ROS and SMOTE, drastically decrease the False Positive Rate (FPR) but increase the False Negative Rate (FNR) in the classification of minority classes. Conclusively, while data-level methods improve classification performance over deep learning models, they, in fact, degrade RF's performance, i.e. cause higher numbers of false predictions. Therefore, RF does not need additional balancing strategies to get higher performance. Although this work addresses the class imbalance problem in SDN intrusion data, it provides a well-designed benchmark that can be exemplary for any network intrusion detection data. Thus, it may have a significant impact on future studies in this respective domain.

show abstract

Section: ) Generative Modelingmentioning

confidence: 99%

Learning From Few Cyber-Attacks: Addressing the Class Imbalance Problem in Machine Learning-Based Intrusion Detection in Software-Defined Networking

Mirsadeghi,

Bahsi,

Vaarandi

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Where most research has focused on modifying GAN architecture to achieve optimal results in class imbalanced settings, Mizra et al [79] posed a distinct yet equally important question: given optimized performance on a desired evaluation metric, what data augmentation method and proportion of synthetic sample injection should be used? The resulting framework, termed Model-Metric Mapper methodology, or MMM, can conversely offer a procedural and hierarchical approach to guide the practitioner toward proper model selection based on desired evaluation metric.…”

Section: Other Disciplinesmentioning

confidence: 99%

“…Borderline-SMOTE and ADAYSN are both frequently cited as baseline methods, and consequently both [20] and [14] possess a high in-degree count. The use of conditional GANs for data generation based on class labels is addressed in [52,79], and [84] (all with five or six in-degrees). With five in-network citations, [85] offers advice vis-à-vis best practices for hyper-tuning GANs at time of publication, though this work is mainly done with computer vision tasks in mind.…”

Section: Citation Network Analysismentioning

confidence: 99%

The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey

Sauber-Cole

Khoshgoftaar

2022

J Big Data

View full text Add to dashboard Cite

The existence of class imbalance in a dataset can greatly bias the classifier towards majority classification. This discrepancy can pose a serious problem for deep learning models, which require copious and diverse amounts of data to learn patterns and output classifications. Traditionally, data-level and algorithm-level techniques have been instrumental in mitigating the adverse effect of class imbalance. With the recent development and proliferation of Generative Adversarial Networks (GANs), researchers across a variety of disciplines have adapted the architecture of GANs and implemented them on imbalanced datasets to generate instances of the underrepresented class(es). Though the bulk of research has been centered on the application of this methodology in computer vision tasks, GANs are likewise being appropriated for use in tabular data, or data consisting of rows and columns with traditional structured data types. In this survey paper, we assess the methodology and efficacy of these modifications on tabular datasets, across domains such network traffic classification and financial transactions over the past seven years. We examine what methodologies and experimental factors have resulted in the greatest machine learning efficacy, as well as the research works and frameworks which have proven most influential in the development of the application of GANs in tabular data settings. Specifically, we note the prevalence of the CGAN architecture, the optimality of novel methods with CNN learners and minority-class sensitive measures such as F1 score, the popularity of SMOTE as a baseline technique, and the improved performance in the year-over-year use of GANs in imbalanced tabular datasets.

show abstract

“…The author identifies seven vital areas of research in this topic, covering the full spectrum of learning from imbalanced data such as classification, regression, clustering, data streams, big data analytics and applications. Fanny et al [11], Ming et al [12], Zhai et al [13], and Mirza et al [14] propose different deep learning approaches to address class imbalance, Fanny et al [11] proposed a method based on Class Expert Generative Adversarial Network (CE-GAN). In this approach, a GAN is trained for each minority class, with the generator network being conditioned on the class label.…”

Section: Introductionmentioning

confidence: 99%

“…This approach improves the performance of minority classes by increasing the diversity of the training data. Mirza et al [14] proposed deep generative models to counter class imbalance. They proposed two approaches: the first approach involves using a variational autoencoder to generate synthetic samples for the minority class, while the second approach involves using a generative adversarial network to generate synthetic samples.…”

Section: Introductionmentioning

confidence: 99%

Ensemble Partition Sampling (EPS) for Improved Multi-Class Classification

et al. 2023

View full text Add to dashboard Cite

Classification is a commonly used technique in data mining and is applied in various fields such as sentiment analysis, fraud detection, and fault diagnosis. Multiclass classification, which involves more than two classes, is more complex than binary classification. There are mainly two ways to approach multiclass classification, one is to expand the binary classifier into a multiclass classifier through various strategies and the other is to divide the multiclass classification problem into multiple binary problems (binarization). Two popular approaches for binarization are One vs One (OvO) and One vs All (OvA). It is simpler to aggregate the outputs of all binary classifiers as the number of classifiers decreases. However, it causes an imbalance of positive and negative sample numbers, which affects the classification effect of each binary classifier. In this article, we contribute to the field of ensemble learning and multi-class classification by proposing a new method called Ensemble Partition Sampling (EPS). This article presents a new approach to multiclass classification using an ''Ensemble Partition Sampling'' method within the ''onevs-all'' (OvA) framework. The primary goal of this method is to tackle the problem of data imbalance by incorporating ensemble learning and preprocessing techniques into each binary dataset. The study found that Ensemble Partition Sampling (EPS) is the most effective method for imbalanced and multiclass imbalanced classification, outperforming other methods including OvA, SMOTE, k-means-SMOTE, Bagging-RB, DES-MI, OvO-EASY, and OvO-SMB. The study used CART, Random Forest, and SVM as classifiers, and the results consistently showed that EPS outperformed all other algorithms. Based on the findings, it can be concluded that the EPS approach is a highly effective method for improving classification performance in imbalanced and multiclass imbalanced datasets.INDEX TERMS Ensemble partition sampling (EPS), one vs one (OvO), one vs all (OvA), multi-class classification, imbalanced learning, multiclass imbalanced classification.

show abstract

Deep Generative Models to Counter Class Imbalance: A Model-Metric Mapping With Proportion Calibration Methodology

Cited by 13 publications

References 50 publications

Learning From Few Cyber-Attacks: Addressing the Class Imbalance Problem in Machine Learning-Based Intrusion Detection in Software-Defined Networking

Learning From Few Cyber-Attacks: Addressing the Class Imbalance Problem in Machine Learning-Based Intrusion Detection in Software-Defined Networking

The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey

Ensemble Partition Sampling (EPS) for Improved Multi-Class Classification

Contact Info

Product

Resources

About