2020
DOI: 10.3390/app10030794
|View full text |Cite
|
Sign up to set email alerts
|

Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets

Abstract: Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of infor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 58 publications
(33 citation statements)
references
References 53 publications
0
33
0
Order By: Relevance
“…The first group of eight characteristics was chosen following the methodology introduced by Sharafaldin et al [ 41 ], where a Random Forest Regressor was used to obtain the behavioral metrics (Subflow_Fwd_Byts, Subflow_Bwd_Byts, TotLen_Fwd_Pkts, TotLen_Bwd_Pkts, Fwd_Pkt_Len_Mean, Bwd_Pkt_Len_Mean, Fwd_Pkts/s and Bwd_Pkts/s) described in Table 1 . The second group of nine features was added based on the methodology presented by Gonzalez-Cuautle et al [ 16 ], where the ISOT HTTP Botnet Dataset was used ( Src_Port, Dst_Port, Flow_Duration, Flow_Byts/s, Flow_Pkts/s, Tot_Fwd_Pkts, Tot_Bwd_ Pkts, Subflow_Bwd_Pkts and Subflow_Fwd_Pkts). 2 features were selected using the ‘feature_importances’ criterion to improve the precision of the Random Forest algorithm—Fwd Pkt Len Max and Fwd Pkt Len Min.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The first group of eight characteristics was chosen following the methodology introduced by Sharafaldin et al [ 41 ], where a Random Forest Regressor was used to obtain the behavioral metrics (Subflow_Fwd_Byts, Subflow_Bwd_Byts, TotLen_Fwd_Pkts, TotLen_Bwd_Pkts, Fwd_Pkt_Len_Mean, Bwd_Pkt_Len_Mean, Fwd_Pkts/s and Bwd_Pkts/s) described in Table 1 . The second group of nine features was added based on the methodology presented by Gonzalez-Cuautle et al [ 16 ], where the ISOT HTTP Botnet Dataset was used ( Src_Port, Dst_Port, Flow_Duration, Flow_Byts/s, Flow_Pkts/s, Tot_Fwd_Pkts, Tot_Bwd_ Pkts, Subflow_Bwd_Pkts and Subflow_Fwd_Pkts). 2 features were selected using the ‘feature_importances’ criterion to improve the precision of the Random Forest algorithm—Fwd Pkt Len Max and Fwd Pkt Len Min.…”
Section: Methodsmentioning
confidence: 99%
“…Gonzalez-Cuautle et al [ 16 ] proposed the Synthetic Minority Oversampling Technique (SMOTE) to address the difficulties to perform botnet classification in highly unbalanced datasets. The method was intended to improve the classification process with synthetically-generated balanced data, and optimally calibrating the parameters of the different ML algorithms in order to avoid overfitting.…”
Section: State Of the Artmentioning
confidence: 99%
See 1 more Smart Citation
“…Imbalance data remains a key challenge against classification models [15,18]. The majority of literature considered re-sampling approaches, i.e., both over-sampling and under-sampling, to alleviate degradation due to the issue of imbalanced data [1,17,19,33,37]. Recent research contributions warn from the limitations and shortcomings accompany re-sampling approaches [16,38,39].…”
Section: Theoretical Backgroundmentioning
confidence: 99%
“…The low volume of the potential target/important customer data (i.e., imbalanced data distribution) is a major challenge in extracting the latent knowledge in banks marketing data [1,3,10]. There is still an insisting need for handling the imbalanced dataset distribution reliably [15][16][17]; commonly used approaches [1,15,16,[18][19][20][21] impose processing overhead or lead to loss of information.…”
Section: Introductionmentioning
confidence: 99%