A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Hussein, Ahmed Saad; Li, Tianrui; Yohannese, Chubato Wondaferaw; Bashir, Kamal

doi:10.2991/ijcis.d.191114.002

Cited by 42 publications

(25 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Usually in order not to avoid elimination of significant majority of instances, Oversampling algorithms are preferred, and Synthetic Minority Oversampling Technique (SMOTE) algorithm proposed by Chawla et al ( 2002 ) is the most widely used. Subsequently more than 85 variants of SMOTE have been reported in literature to further improve the basic form of SMOTE in terms of different classification metrics (Fernández et al 2018 ) like borderline-SMOTE1 and borderline-SMOTE2, advanced SMOTE (A-SMOTE), Distributed version of SMOTE (Han et al 2005 ; Hooda and Mann 2019 ; Hussein 2019 ), etc. There seems to be only few literature reports dealing with detailed critical comparison of these proposed methods (Bajer et al 2019 ; Kovács 2019 ).…”

Section: Background Literaturementioning

confidence: 99%

Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms

Sowjanya

Mrudula

2022

Appl Nanosci

View full text Add to dashboard Cite

One of the prominent uses of Predictive Analytics is Health care for more accurate predictions based on proper analysis of cumulative datasets. Often times the datasets are quite imbalanced and sampling techniques like Synthetic Minority Oversampling Technique (SMOTE) give only moderate accuracy in such cases. To overcome this problem, a two-step approach has been proposed. In the first step, SMOTE is modified to reduce the class imbalance in terms of Distance-based SMOTE (D-SMOTE) and Bi-phasic SMOTE (BP-SMOTE) which were then coupled with selective classifiers for prediction. An increase in accuracy is noted for both BP-SMOTE and D-SMOTE compared to basic SMOTE. In the second step, Machine learning, Deep Learning and Ensemble algorithms were used to develop a Stacking Ensemble Framework which showed a significant increase in accuracy for Stacking compared to individual machine learning algorithms like Decision Tree, Naïve Bayes, Neural Networks and Ensemble techniques like Voting, Bagging and Boosting. Two different methods have been developed by combing Deep learning with Stacking approach namely Stacked CNN and Stacked RNN which yielded significantly higher accuracy of 96–97% compared to individual algorithms. Framingham dataset is used for data sampling, Wisconsin Hospital data of Breast Cancer study is used for Stacked CNN and Novel Coronavirus 2019 dataset relating to forecasting COVID-19 cases, is used for Stacked RNN.

show abstract

Section: Background Literaturementioning

confidence: 99%

Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms

Sowjanya

Mrudula

2022

Appl Nanosci

View full text Add to dashboard Cite

show abstract

“…La suma de estos valores es de 40 y se divide por el número de filas de la tabla, para este caso es 18. El resultado es 2.222 y al ser el promedio más bajo, ocupa el número 1 en el ranking [30].…”

Section: Resultados Y Discusiónunclassified

Impacto de los algoritmos de sobremuestreo en la clasificación de subtipos principales del síndrome de guillain-barré

et al. 2020

View full text Add to dashboard Cite

El Síndrome de Guillain-Barré es un trastorno neu-rológico donde el sistema inmune del cuerpo ataca al sistema nervioso periférico. Esta enfermedad es de rápida evolución y es la causa más frecuente de parálisis del cuerpo. Existen cuatro variantes de SGB: Polineuropatía Desmielinizante Inflamatoria Aguda, Neuropatía Axonal Motora Aguda, Neuropatía Axonal Sensorial Aguda y Síndrome de Miller-Fisher. Identificar el subtipo de SGB que el paciente contrajo es determinante debido a que el tratamiento es diferente para cada subtipo. El objetivo de este estudio fue determinar cuál algoritmo de sobremuestreo mejora el rendimiento de los clasificadores. Además, determinar si balancear los datos mejoran el rendimiento de los modelos predictivos. Aplicamos tres métodos de sobremuestro (ROS, SMOTE y ADASYN) a la clase minoritaria, utilizamos tres clasificadores (C4.5,SVM y JRip). El rendimiento de los modelos se obtuvo mediante la curva ROC. Los resultados muestran que balancear el dataset mejora el rendimiento de los modelos predictivos. El algoritmo SMOTE fue el mejor método de balanceo en combinación con el clasificador JRip para OVO y el clasificador C4.5para OVA.

show abstract

“…Synthetic instances that are far from the borderline are easier to categorize than those that are near to the borderline, which present a significant learning difficulty for the majority of classifiers. The authors in [ 32 ] describe an advanced strategy (A-SMOTE) for preprocessing imbalanced training sets based on these findings. It aims to clearly characterize the borderline and create pure synthetic samples from SMOTE generalization.…”

Section: Proposed Methodologymentioning

confidence: 99%

“…AdaBoost makes it possible to merge various “weak classifiers” into a single classifier which is called “strong classifier.” Decision trees with one level, or decision trees with only one split, are the most popular algorithm used with AdaBoost. Decision Stump is another name for these trees [ 32 ]. This approach creates a model by assigning equal weights to all of the data points.…”

Section: Exploratory Knowledgementioning

confidence: 99%

AdaBoost Ensemble Methods Using K-Fold Cross Validation for Survivability with the Early Detection of Heart Disease

Mahesh

Kumar

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

As a result of technology improvements, various features have been collected for heart disease diagnosis. Large data sets have several drawbacks, including limited storage capacity and long access and processing times. For medical therapy, early diagnosis of heart problems is crucial. Disease of heart is a devastating human disease that is quickly increasing in developed and also developing countries, resulting in death. In this type of disease, the heart normally fails to provide enough blood to different body parts in order to allow them to perform their regular functions. Early, as well as, proper diagnosis of this condition is very critical for averting further damage and also to save patients’ lives. In this work, machine learning (ML) is utilized to find out whether a person has cardiac disease or not. Both the types of ensemble classifiers, namely, homogeneous as well as heterogeneous classifiers (formed by combining two separate classifiers), have been implemented in this work. The data mining preprocessing using Synthetic Minority Oversampling Technique (SMOTE) has been employed to cope with the imbalance problem of the class as well as noise. The proposed work has two steps. SMOTE is used in the initial phase to reduce the impact of data imbalance and the second phase is classifying data using Naive Bayes (NB), decision tree (DT) algorithms, and their ensembles. The experimental results demonstrate that the AdaBoost-Random Forest classifier provides 95.47% accuracy in the early detection of heart disease.

show abstract

A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Cited by 42 publications

References 46 publications

Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms

Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms

Impacto de los algoritmos de sobremuestreo en la clasificación de subtipos principales del síndrome de guillain-barré

AdaBoost Ensemble Methods Using K-Fold Cross Validation for Survivability with the Early Detection of Heart Disease

Contact Info

Product

Resources

About