Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning

Engelmann, Justin; Lessmann, Stefan

doi:10.1016/j.eswa.2021.114582

Cited by 146 publications

(75 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The imbalanced classification performance of our approach is compared against four commonly used resampling method, i.e., CCR [5], k SMOTE [3], GAN [17] and CUSBoost [18]. The former two are oversampling, while the last one is undersampling.…”

Section: Experimental Set-upsmentioning

confidence: 99%

Oversampling by a Constraint-Based Causal Network in Medical Imbalanced Data Classification

Luo

Liao

Yan

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

A key challenge of oversampling in medical imbalanced data classification is that the generation of new minority samples often neglects rich causal dependencies among features, with each being responsible for disease diagnosis. This leads us to define a constraint-based approach that generates new samples by explicitly discovering and leveraging the inherent local causal variability of features under a global view. Our approach employs causal Markov property to construct a causal network that explicitly characterizes these unique causal configurations of a particular disease as a variable number of nodes and links. By perturbing those learned causal features from majority class, we synthesize new samples in the territory of minority space. An additional sample selection estimator is introduced to choose the most representative samples. Empirical evaluations on four medical datasets suggest our approach significantly outperforms the state-of-the-art methods.

show abstract

Section: Experimental Set-upsmentioning

confidence: 99%

Oversampling by a Constraint-Based Causal Network in Medical Imbalanced Data Classification

Luo

Liao

Yan

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

show abstract

“…To this end, we propose a technique to generate realistic botnet data using Generative Adversarial Networks (GANs) to improve the classifiers' decision making to detect potential evasion samples. GANs have proved to be highly effective in some recent research works [6]- [10]. A GAN is a combination of two different AI models competitively learning to generate realistic samples.…”

Section: Introductionmentioning

confidence: 99%

“…In such scenarios, the machine learning classifiers over-fit the majority class and fall short in generalizing the test set [12]. The motivation for using GANs for data oversampling is their effectiveness in mimicking complex probably distributions [10]. To address the low-data regime problem, synthetic oversampling techniques like SMOTE [13] are employed, but these techniques depend on algorithms like nearest neighbours and linear interpolation which make them unsuitable for highdimensional and complex probability distributions of data [10].…”

Section: Introductionmentioning

confidence: 99%

“…The motivation for using GANs for data oversampling is their effectiveness in mimicking complex probably distributions [10]. To address the low-data regime problem, synthetic oversampling techniques like SMOTE [13] are employed, but these techniques depend on algorithms like nearest neighbours and linear interpolation which make them unsuitable for highdimensional and complex probability distributions of data [10]. Although the classifiers' performance could be improved to a certain extent by using synthetic data oversampling, GANs can be more effective [14].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Security Hardening of Botnet Detectors Using Generative Adversarial Networks

Randhawa

Aslam²,

Alauthman³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Machine learning (ML) based botnet detectors are no exception to traditional ML models when it comes to adversarial evasion attacks. The datasets used to train these models have also scarcity and imbalance issues. We propose a new technique named Botshot, based on generative adversarial networks (GANs) for addressing these issues and proactively making botnet detectors aware of adversarial evasions. Botshot is cost-effective as compared to the network emulation for botnet traffic data generation rendering the dedicated hardware resources unnecessary. First, we use the extended set of network flow and time-based features for three publicly available botnet datasets. Second, we utilize two GANs (vanilla, conditional) for generating realistic botnet traffic. We evaluate the generator performance using classifier two-sample test (C2ST) with 10-fold 70-30 train-test split and propose the use of 'recall' in contrast to 'accuracy' for proactively learning adversarial evasions. We then augment the train set with the generated data and test using the unchanged test set. Last, we compare our results with benchmark oversampling methods with augmentation of additional botnet traffic data in terms of average accuracy, precision, recall and F1 score over six different ML classifiers. The empirical results demonstrate the effectiveness of the GAN-based oversampling for learning in advance the adversarial evasion attacks on botnet detectors.

show abstract

“…The basic idea of GAN is to use a generator to generate samples that people need from random data points that meet a specific distribution (for example, Gaussian distribution). Some scholars use the ability of GAN to learn images and apply it to the field of image anomaly detection, such as AnoGAN [36], BiGAN [37] and GANomaly [38] and some GAN-based imbalanced data intrusion detection models [61]- [63]. These GAN-based network architectures have shown high performance.…”

Section: Introductionmentioning

confidence: 99%

Self-Adaption AAE-GAN for Aluminum Electrolytic Cell Anomaly Detection

Cao

Liu

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Nowadays, the anomaly detection of aluminum electrolysis cell is a big problem in the aluminum electrolysis industry. The problem of unbalanced time series samples is common in industrial applications. The number of samples under normal conditions is much larger than that under abnormal conditions. In the electrolytic aluminum industry, this problem is even more serious, it is very difficult to find abnormal samples in industrial production because experts do not have a clear criterion to judge abnormalities. In traditional machine learning algorithms, such as support vector machine (SVM) and convolutional neural network (CNN), it is difficult to obtain high classification accuracy on the problem of class imbalance, and these methods tend to be more biased towards positive samples. In recent years, generative adversarial network (GAN) has become more and more popular in the field of anomaly detection. However, these methods need to find the best mapping from the actual space to the latent space in the anomaly detection stage, and the optimization process may bring new errors and take a long time. In this article, we use the ability of GAN to model complex high-dimensional image distribution, and propose a self-adaption AAE-GAN network based on adaptive changes of input samples. This time series anomaly detection method converts multi-dimensional time series data into a two-dimensional matrix, and only normal samples are needed in the training process, which effectively solves the above problems. The method we proposed is to use encoder and decoder to constitute a generator and a discriminator. During the training process, the generator and the discriminator are trained jointly and confrontationally, so that the mapping ability of the encoder can be fully reflected. In the anomaly detection stage, we determine whether the sample is abnormal according to the size of the reconstruction difference. Experimental results show that the detection accuracy and speed of this method are very high.INDEX TERMS Aluminum electrolytic cell, anomaly detection, AAE-GAN, multivariate time series, imbalanced industrial time series.

show abstract

Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning

Cited by 146 publications

References 19 publications

Oversampling by a Constraint-Based Causal Network in Medical Imbalanced Data Classification

Oversampling by a Constraint-Based Causal Network in Medical Imbalanced Data Classification

Security Hardening of Botnet Detectors Using Generative Adversarial Networks

Self-Adaption AAE-GAN for Aluminum Electrolytic Cell Anomaly Detection

Contact Info

Product

Resources

About