SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling

Agrawal, Ankita; Viktor, Herna L.; Paquet, Éric

doi:10.5220/0005595502260234

Cited by 74 publications

(43 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Second, we proposed an integrated resampling approach, by using the SMOTE for over-sampling and PSO for under-sampling. Although some hybrid resampling approaches were presented in previous studies (Agrawal et al 2015;Huda et al 2018), integrating SMOTE-based over-sampling and PSO-based under-sampling has not been explored before. Third, by compiling real-world datasets with different imbalance ratios and testing them with eight machine learning methods, the proposed integrated resampling approach was comprehensively evaluated.…”

Section: Conclusion Implications and Future Research Directionsmentioning

confidence: 99%

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Chiong

Pranata

et al. 2019

IMDS

View full text Add to dashboard Cite

Purpose -Malicious web domain identification is of significant importance to the security protection of Internet users. With online credibility and performance data, this paper aims to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e., there are more benign web domains than malicious ones). Design/methodology/approach -We propose an integrated resampling approach to handle class imbalance by combining the Synthetic Minority Over-sampling TEchnique (SMOTE) and Particle Swarm Optimisation (PSO), a population-based meta-heuristic algorithm. We use the SMOTE for over-sampling and PSO for under-sampling. Findings -By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain datasets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective. Practical implications -This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains, but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification. Originality/value -Online credibility and performance data is applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world datasets with different imbalance ratios.

show abstract

Section: Conclusion Implications and Future Research Directionsmentioning

confidence: 99%

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Chiong

Pranata

et al. 2019

IMDS

View full text Add to dashboard Cite

show abstract

“…The number of the method's iterations is set as the number of classes. There is also a limited number of works on combinations of oversampling with undersampling (Agrawal et al, 2015), which include a selective hybrid resampling SPIDER3 (Wojciechowski et al, 2017), where relations between classes are captured by predefined misclassification costs. Moreover, Seaz et al (2016) have applied types of minority examples of Napierala and Stefanowski (2012) to independently oversample single minority classes, however without considering any relations between classes.…”

Section: Related Work On Multiclass Imbalancesmentioning

confidence: 99%

“…In SOUP, all majority classes are undersampled and all minority classes are oversampled to the cardinality being the average of the sizes of the biggest minority and the smallest majority class (line 3). It is partly inspired by experiences with SCUT undersampling (Agrawal et al, 2015). This provides us not only a dataset with a balanced class distribution, but also with a reasonable size.…”

Section: Resampling Algorithm Soupmentioning

confidence: 99%

Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm

Janicka

Lango

Stefanowski

2019

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

The relations between multiple imbalanced classes can be handled with a specialized approach which evaluates types of examples’ difficulty based on an analysis of the class distribution in the examples’ neighborhood, additionally exploiting information about the similarity of neighboring classes. In this paper, we demonstrate that such an approach can be implemented as a data preprocessing technique and that it can improve the performance of various classifiers on multiclass imbalanced datasets. It has led us to the introduction of a new resampling algorithm, called Similarity Oversampling and Undersampling Preprocessing (SOUP), which resamples examples according to their difficulty. Its experimental evaluation on real and artificial datasets has shown that it is competitive with the most popular decomposition ensembles and better than specialized preprocessing techniques for multi-imbalanced problems.

show abstract

“…Masalah data kelas tidak seimbang sering disebabkan oleh satu kelas kalah banyak dengan kelas lain didalam dataset [1] [2]. Masalah ini banyak dijumpai diberbagai data pada domain aplikasi seperti deteksi tumpahan minyak [4], pengindraan jarak jauh [5] klasifikasi teks [6], pemodelan respon [7], penilaian kualitas data sensor [8], deteksi kartu kredit palsu [9] dan extraksi pengetahuan dari database [10] sehingga hal ini menjadi penting bagi para peneliti di bidang data mining [11]. Namun dalam maslah ini cukupa sulit karena algoritma klasifikasi tradisional bias terhadap kelas minoritas [12], artinya apabila dipaksakan hasil prediksi dapat mendekati keliru bahkan salah [13].…”

Section: Pendahuluanunclassified

Komparasi Algoritma Klasifikasi dengan Pendekatan Level Data untuk Menangani Data Kelas Tidak Seimbang

Ilham¹

2018

Preprint

View full text Add to dashboard Cite

Saat ini data real dari berbagai sumber sangat banyak mengandung data dengan kelas tidak seimbang. Masalah data kelas tidak seimbang dapat menimbulkan efek buruk pada metode klasifikasi untuk ketepatan prediksi pada data. Untuk menangani masalah ini, telah banyak penelitian sebelumnya menggunakan algoritma klasifikasi menangani masalah data kelas tidak seimbang. Pada penelitian ini akan menyajikan teknik under-sampling dan over-sampling untuk menangani data kelas tidak seimbang. Teknik ini akan digunakan pada tingkat preprocessing untuk menyeimbangkan kondisi kelas pada data. Hasil eksperimen menunjukkan neural network (NN) lebih unggul dari decision tree (DT), linear regression (LR), naïve bayes (NB) dan support vector machine (SVM).

show abstract

SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling

Cited by 74 publications

References 11 publications

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm

Komparasi Algoritma Klasifikasi dengan Pendekatan Level Data untuk Menangani Data Kelas Tidak Seimbang

Contact Info

Product

Resources

About