SMOTE bagging algorithm for imbalanced dataset in logistic regression analysis (case: credit of bank X)

Hanifah, Fithria Siti; Wijayanto, Hari; Kurnia, Anang

doi:10.12988/ams.2015.58562

Cited by 31 publications

(10 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A thing to take note when using supervised method for training is imbalanced data: The predictive models developed using conventional machine learning algorithms could be biased and inaccurate because the number of observations in one class of the dataset is significantly lower than the other. To handle imbalanced data, several methods can be used, including resampling, boosting, bagging [17][18][19][20].…”

Section: Supervised Modelmentioning

confidence: 99%

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Shi

Santoro

2020

Sensors

View full text Add to dashboard Cite

To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses sliding window and kernel function to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and statistical-summary to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the masquerading problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies.

show abstract

Section: Supervised Modelmentioning

confidence: 99%

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Shi

Santoro

2020

Sensors

View full text Add to dashboard Cite

show abstract

“…In 2014, a method based on cost-sensitive decision tree with feature space partitioning was introduced (Krawczyk et al, 2014), and the results were computed on different benchmark datasets with varying imbalance ratios (IR). The analysis of SMOTEBagging with logistic regression using credit scoring data revealed its higher degree of accuracy compared to a simple logistic algorithm (Hanifah et al, 2015). A new ensemble classification method using random undersampling and ROSE sampling under a boosting scheme RHSBoost was proposed to address the imbalance classification problem (Gong & Kim, 2017).…”

Section: Related Workmentioning

confidence: 99%

Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems

Sisodia¹,

Verma²

2019

IJTech

View full text Add to dashboard Cite

Traditional classification algorithms often fail in learning from highly imbalanced datasets because the training involves most of the samples from majority class compared to the other existing minority class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB) technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique in which the ensemble of multiple instances of the single learner is replaced by multiple distinct classifiers. The proposed ML-ESB is designed for handling only the binary class imbalance problem. In ML-ESB the ensembles of multiple distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based on six binary imbalanced benchmark datasets using evaluation measures such as specificity, sensitivity, and area under receiver operating curve. The obtained results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm outperformed other existing methods on four datasets with high dimensions and class IR, whereas it showed moderate performance on the remaining two low dimensions and small IR value datasets.

show abstract

“…To overcome such limitations, several algorithms have been proposed as extensions of SMOTE. Some are focusing on improving the generation of synthetic data by combining SMOTE with other oversampling techniques, including the combination of SMOTE with Tomek-links (Elhassan et al 2016), particle swarm optimization (Gao et al 2011;Wang et al 2014), rough set theory (Ramentol et al 2012), kernel based approaches (Mathew et al 2015), Boosting (Chawla et al 2003), and Bagging (Hanifah et al 2015). Other approaches choose subsets of the minority class data to generate SMOTE samples or cleverly limit the number of synthetic data generated (Santoso et al 2017).…”

Section: Introductionmentioning

confidence: 99%

LoRAS: an oversampling approach for imbalanced datasets

et al. 2020

View full text Add to dashboard Cite

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

show abstract

SMOTE bagging algorithm for imbalanced dataset in logistic regression analysis (case: credit of bank X)

Cited by 31 publications

References 10 publications

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems

LoRAS: an oversampling approach for imbalanced datasets

Contact Info

Product

Resources

About