Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

Engelmann, Justin; Lessmann, Stefan

doi:10.48550/arxiv.2008.09202

Cited by 11 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to (Fiore et al 2019), the author synthesises the underrepresented class hence making known the class label. Other work has shown the efficacy of generative networks above traditional methods (Liu et al 2019;Ngwenduna and Mbuvha 2021;Engelmann and Lessmann 2020). However, we do not find any studies reporting the similarity of the synthesisers to the original dataset.…”

Section: Introductioncontrasting

confidence: 58%

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

Muller¹,

Xu²,

Hayes³

2022

Preprint

View full text Add to dashboard Cite

Class imbalance can often degrade predictive performance of supervised learning algorithms. Balanced classes can be obtained by oversampling exact copies, with noise, or interpolation between nearest neighbours (as in traditional SMOTE methods). Oversampling using augmentation, as is typical in computer vision tasks, can be achieved with deep generative models. Deep generative models are effective data synthesisers due to their ability to capture complex underlying distributions. Synthetic data in healthcare can enhance interoperability between healthcare providers by ensuring patient privacy. Equipped with large synthetic datasets which do well to represent small patient groups, machine learning in healthcare can address the current challenges of bias and generalisability. This paper evaluates synthetic data generators ability to synthesise patient electronic health records. We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.

show abstract

Section: Introductioncontrasting

confidence: 58%

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

Muller¹,

Xu²,

Hayes³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, undersampling might result in the loss of diversity. For oversampling, methods like SMOTE use nearest neighbours and linear interpolation, which can be unsuitable for high-dimensional and complex probability distributions [8], [21]. Recent research works proposed algorithms for data oversampling.…”

Section: B Data Oversampling and Gansmentioning

confidence: 99%

“…Using generative adversarial networks (GANs) as synthetic oversamplers has been a voguish research endeavour for low data regimes [3], [7]. Various researchers have demonstrated that GANs are more effective as compared to other synthetic oversamplers like SMOTE [2], [6], [8], [9]. It is found in many studies that due to the adversarial factor, GANs can better estimate the target probability distribution [2], [8], [10].…”

Section: Introductionmentioning

confidence: 99%

“…Various researchers have demonstrated that GANs are more effective as compared to other synthetic oversamplers like SMOTE [2], [6], [8], [9]. It is found in many studies that due to the adversarial factor, GANs can better estimate the target probability distribution [2], [8], [10]. In a simple/vanilla GAN, two different neural networks generator (G) and discriminator (D) work antagonistically to learn from each other's experience to converge to Nash equilibrium [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

EVAGAN: Evasion Generative Adversarial Network for Low Data Regimes

Randhawa¹,

Aslam²,

Alauthman³

et al. 2021

Preprint

View full text Add to dashboard Cite

Many recent literary works have leveraged generative adversarial networks (GANs) to spawn unseen evasion samples. The purpose is to annex the generated data with the original train set for adversarial training to improve the detection performance of machine learning (ML) classifiers. The quality of generating adversarial samples relies on the adequacy of training data samples. However, in low data regimes like medical anomaly detection, drug discovery and cybersecurity, the attack samples are scarce in number. This paper proposes a novel GAN design called Evasion Generative Adversarial Network (EVAGAN) that is more suitable for low data regime problems that use oversampling for detection improvement of ML classifiers. EVAGAN not only can generate evasion samples, but its discriminator can act as an evasion aware classifier. We have considered Auxiliary Classifier GAN (ACGAN) as a benchmark to evaluate the performance of EVAGAN on cybersecurity (ISCX-2014, CIC-2017 and CIC2018) botnet and CV (MNIST) datasets. We demonstrate that EVAGAN outperforms ACGAN for unbalanced datasets with respect to detection performance, training stability, time complexity. EVAGAN's generator quickly learns to generate the low sample class and hardens its discriminator simultaneously. In contrast to ML classifiers that require security hardening after being adversarially trained by GAN generated data, EVAGAN renders it needless. The experimental analysis proves EVAGAN to be an efficient evasion hardened model for low data regimes in cybersecurity and CV. Code will be available at https://github.com/rhr407/EVAGAN.Impact Statement-The applications of Artificial Intelligence (AI) can help improve the quality of human life. The use of AI is not only limited to medical anomaly detection and drug discovery but can be leveraged in computer networks to keep people safe from malicious activities on the Internet. However, the AI-based models can be biased towards the majority class of data on which they are trained due to data imbalance. Anomaly data samples are always scarce as compared to the normal data samples. So this is an open research problem to solve. Our work is an effort to improve the AI-based methods in detection performance, time complexity and stability. Using the proposed technique, we can train our AI model using fewer anomaly samples efficiently and reduce the time complexity compared to the state-of-the-art in anomaly detection.

show abstract

“…GAN's framework corresponds to a minimax two-player game, it simultaneously trains two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. GAN is becoming more and more popular in the field of content generation [45,46]. In the field of credit scoring, GAN has been used to solve the sample imbalance problem [47].…”

Section: Adversarial Validationmentioning

confidence: 99%

Managing dataset shift by adversarial validation for credit scoring

Qian¹,

Wang²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

Dataset shift is common in credit scoring scenarios, and the inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance. However, most of the current studies do not take this into account, and they directly mix data from different time periods when training the models. This brings about two problems. Firstly, there is a risk of data leakage, i.e., using future data to predict the past. This can result in inflated results in offline validation, but unsatisfactory results in practical applications. Secondly, the macroeconomic environment and risk control strategies are likely to be different in different time periods, and the behavior patterns of borrowers may also change. The model trained with past data may not be applicable to the recent stage. Therefore, we propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios. In this method, partial training set samples with the closest distribution to the predicted data are selected for cross-validation by adversarial validation to ensure the generalization performance of the trained model on the predicted samples. In addition, through a simple splicing method, samples in the training data that are inconsistent with the test data distribution are also involved in the training process of cross-validation, which makes full use of all the data and further improves the model performance. To verify the effectiveness of the proposed method, comparative experiments with several other data split methods are conducted with the data provided by Lending Club. The experimental results demonstrate the importance of dataset shift in the field of credit scoring and the superiority of the proposed method.

show abstract

Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

Cited by 11 publications

References 0 publications

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

EVAGAN: Evasion Generative Adversarial Network for Low Data Regimes

Managing dataset shift by adversarial validation for credit scoring

Contact Info

Product

Resources

About