Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets

Gonzalez-Cuautle, David; Hernandez-Suarez, Aldo; Sánchez, Gabriel; Sánchez-Pérez, Gabriel; Portillo-Portillo, Jose; Olivares-Mercado, Jesús; Pérez, Héctor; Orozco, Ana Lucila Sandoval

doi:10.3390/app10030794

Cited by 58 publications

(33 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first group of eight characteristics was chosen following the methodology introduced by Sharafaldin et al [ 41 ], where a Random Forest Regressor was used to obtain the behavioral metrics (Subflow_Fwd_Byts, Subflow_Bwd_Byts, TotLen_Fwd_Pkts, TotLen_Bwd_Pkts, Fwd_Pkt_Len_Mean, Bwd_Pkt_Len_Mean, Fwd_Pkts/s and Bwd_Pkts/s) described in Table 1 . The second group of nine features was added based on the methodology presented by Gonzalez-Cuautle et al [ 16 ], where the ISOT HTTP Botnet Dataset was used ( Src_Port, Dst_Port, Flow_Duration, Flow_Byts/s, Flow_Pkts/s, Tot_Fwd_Pkts, Tot_Bwd_ Pkts, Subflow_Bwd_Pkts and Subflow_Fwd_Pkts). 2 features were selected using the ‘feature_importances’ criterion to improve the precision of the Random Forest algorithm—Fwd Pkt Len Max and Fwd Pkt Len Min.…”

Section: Methodsmentioning

confidence: 99%

“…Gonzalez-Cuautle et al [ 16 ] proposed the Synthetic Minority Oversampling Technique (SMOTE) to address the difficulties to perform botnet classification in highly unbalanced datasets. The method was intended to improve the classification process with synthetically-generated balanced data, and optimally calibrating the parameters of the different ML algorithms in order to avoid overfitting.…”

Section: State Of the Artmentioning

confidence: 99%

“…Under such classification-based threat analysis, attaining acceptable detection rates is highly tied to the configuration of hyperparameters set up for training the machine learning models. Even though some heuristic approaches might be implemented for hyperparameters estimation, more advanced methods such as Grid Search have been effectively used to boost up the performance of machine learning algorithms in the detection of botnets, as it was demonstrated by Gonzalez-Cuautle et al [ 16 ]. In terms of classification accuracy, most of the research works on this field describe their results reasonably enough to put into perspective the adequateness of the machine learning models guided by the results obtained, thus posing a reference baseline for future research as well.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Benchmark-Based Reference Model for Evaluating Botnet Detection Tools Driven by Traffic-Flow Analytics

Ramos

Monge

Vidal

2020

Sensors

View full text Add to dashboard Cite

Botnets are some of the most recurrent cyber-threats, which take advantage of the wide heterogeneity of endpoint devices at the Edge of the emerging communication environments for enabling the malicious enforcement of fraud and other adversarial tactics, including malware, data leaks or denial of service. There have been significant research advances in the development of accurate botnet detection methods underpinned on supervised analysis but assessing the accuracy and performance of such detection methods requires a clear evaluation model in the pursuit of enforcing proper defensive strategies. In order to contribute to the mitigation of botnets, this paper introduces a novel evaluation scheme grounded on supervised machine learning algorithms that enable the detection and discrimination of different botnets families on real operational environments. The proposal relies on observing, understanding and inferring the behavior of each botnet family based on network indicators measured at flow-level. The assumed evaluation methodology contemplates six phases that allow building a detection model against botnet-related malware distributed through the network, for which five supervised classifiers were instantiated were instantiated for further comparisons—Decision Tree, Random Forest, Naive Bayes Gaussian, Support Vector Machine and K-Neighbors. The experimental validation was performed on two public datasets of real botnet traffic—CIC-AWS-2018 and ISOT HTTP Botnet. Bearing the heterogeneity of the datasets, optimizing the analysis with the Grid Search algorithm led to improve the classification results of the instantiated algorithms. An exhaustive evaluation was carried out demonstrating the adequateness of our proposal which prompted that Random Forest and Decision Tree models are the most suitable for detecting different botnet specimens among the chosen algorithms. They exhibited higher precision rates whilst analyzing a large number of samples with less processing time. The variety of testing scenarios were deeply assessed and reported to set baseline results for future benchmark analysis targeted on flow-based behavioral patterns.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: State Of the Artmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Benchmark-Based Reference Model for Evaluating Botnet Detection Tools Driven by Traffic-Flow Analytics

Ramos

Monge

Vidal

2020

Sensors

View full text Add to dashboard Cite

show abstract

“…Imbalance data remains a key challenge against classification models [15,18]. The majority of literature considered re-sampling approaches, i.e., both over-sampling and under-sampling, to alleviate degradation due to the issue of imbalanced data [1,17,19,33,37]. Recent research contributions warn from the limitations and shortcomings accompany re-sampling approaches [16,38,39].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…The low volume of the potential target/important customer data (i.e., imbalanced data distribution) is a major challenge in extracting the latent knowledge in banks marketing data [1,3,10]. There is still an insisting need for handling the imbalanced dataset distribution reliably [15][16][17]; commonly used approaches [1,15,16,[18][19][20][21] impose processing overhead or lead to loss of information.…”

Section: Introductionmentioning

confidence: 99%

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

et al. 2020

View full text Add to dashboard Cite

The banking industry has been seeking novel ways to leverage database marketing efficiency. However, the nature of bank marketing data hindered the researchers in the process of finding a reliable analytical scheme. Various studies have attempted to improve the performance of Artificial Neural Networks in predicting clients’ intentions but did not resolve the issue of imbalanced data. This research aims at improving the performance of predicting the willingness of bank clients to apply for a term deposit in highly imbalanced datasets. It proposes enhanced Artificial Neural Network models (i.e., cost-sensitive) to mitigate the dramatic effects of highly imbalanced data, without distorting the original data samples. The generated models are evaluated, validated, and consequently compared to different machine-learning models. A real-world telemarketing dataset from a Portuguese bank is used in all the experiments. The best prediction model achieved 79% of geometric mean, and misclassification errors were minimized to 0.192, 0.229 of Type I & Type II Errors, respectively. In summary, an interesting Meta-Cost method improved the performance of the prediction model without imposing significant processing overhead or altering original data samples.

show abstract

Empirical Investigation of Resampling Techniques in an Intruder Detection System

Puri

Gupta

2021

Innovations in Information and Communication Technologies (IICT-2020)

View full text Add to dashboard Cite

Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets

Cited by 58 publications

References 53 publications

Benchmark-Based Reference Model for Evaluating Botnet Detection Tools Driven by Traffic-Flow Analytics

Benchmark-Based Reference Model for Evaluating Botnet Detection Tools Driven by Traffic-Flow Analytics

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

Empirical Investigation of Resampling Techniques in an Intruder Detection System

Contact Info

Product

Resources

About