Botnets are one of the online threats with the most significant presence, causing billionaire losses to global economies. Nowadays, the increasing number of devices connected to the Internet makes it necessary to analyze extensive network traffic data. In this work, we focus on increasing the performance of botnet traffic classification by selecting those features that further increase the detection rate. For this purpose, we use two feature selection techniques, i.e., Information Gain and Gini Importance, which led to three pre-selected subsets of five, six and seven features. Then, we evaluate the three feature subsets and three models, i.e., Decision Tree, Random Forest and k-Nearest Neighbors. To test the performance of the three feature vectors and the three models, we generate two datasets based on the CTU-13 dataset, namely QB-CTU13 and EQB-CTU13. Finally, we measure the performance as the macro averaged F1 score over the computational time required to classify a sample. The results show that the highest performance is achieved by Decision Trees using a five feature set, which obtained a mean F1 score of 85% classifying each sample in an average time of 0.78 microseconds.
Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-11K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-11K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Logistic Regression and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, 95.39%, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in 2.13ms.
Botnets are one of the most harmful cyberthreats, that can perform many types of cyberattacks and cause billionaire losses to the global economy. Nowadays, vast amounts of network traffic are generated every second, hence manual analysis is impossible. To be effective, automatic botnet detection should be done as fast as possible, but carrying this out is difficult in large bandwidths. To handle this problem, we propose an approach that is capable of carrying out an ultra-fast network analysis (i.e. on windows of one second), without a significant loss in the F1-score. We compared our model with other three literature proposals, and achieved the best performance: an F1 score of 0.926 with a processing time of 0.007 ms per sample. We also assessed the robustness of our model on saturated networks and on large bandwidths. In particular, our model is capable of working on networks with a saturation of 10% of packet loss, and we estimated the number of CPU cores needed to analyze traffic on three bandwidth sizes. Our results suggest that using commercial-grade cores of 2.4 GHz, our approach would only need four cores for bandwidths of 100 Mbps and 1 Gbps, and 19 cores on 10 Gbps networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.