Data Feature Selection Methods on Distributed Big Data Processing Platforms

Catalkaya, Mehmet Burak; Kalipsiz, Oya; Aktaş, Mehmet S.; Turgut, Umut Orcun

doi:10.1109/ubmk.2018.8566451

Cited by 12 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is done so that all features considered have a similar dynamic range rather than one feature dominating due to its large dynamic range [17]. The second step is applying Synthetic Minority Oversampling TEchnique (SMOTE) to tackle the class imbalance problem often 2) Feature selection: The goal at this stage is to reduce the number of features inputted to the ML model to help reduce its computational complexity while still maintaining or even improving its detection performance [31]. To achieve this, information gain method is used to select the relevant features by ranking them according to the amount of information (in bits) they provide about the class [32].…”

Section: A Proposed Approach Descriptionmentioning

confidence: 99%

Optimized Random Forest Model for Botnet Detection Based on DNS Queries

Moubayed

Injadat

Shami

2020

2020 32nd International Conference on Microelectronics (ICM)

View full text Add to dashboard Cite

The Domain Name System (DNS) protocol plays a major role in today's Internet as it translates between website names and corresponding IP addresses. However, due to the lack of processes for data integrity and origin authentication, the DNS protocol has several security vulnerabilities. This often leads to a variety of cyber-attacks, including botnet network attacks. One promising solution to detect DNS-based botnet attacks is adopting machine learning (ML) based solutions. To that end, this paper proposes a novel optimized ML-based framework to detect botnets based on their corresponding DNS queries. More specifically, the framework consists of using information gain as a feature selection method and genetic algorithm (GA) as a hyperparameter optimization model to tune the parameters of a random forest (RF) classifier. The proposed framework is evaluated using a state-of-the-art TI-2016 DNS dataset. Experimental results show that the proposed optimized framework reduced the feature set size by up to 60%. Moreover, it achieved a high detection accuracy, precision, recall, and F-score compared to the default classifier. This highlights the effectiveness and robustness of the proposed framework in detecting botnet attacks.

show abstract

Section: A Proposed Approach Descriptionmentioning

confidence: 99%

Optimized Random Forest Model for Botnet Detection Based on DNS Queries

Moubayed

Injadat

Shami

2020

2020 32nd International Conference on Microelectronics (ICM)

View full text Add to dashboard Cite

show abstract

“…The second phase, a sub-set of features are selected using different feature selection mechanisms to be given to the classification model as input. This is done in an attempt to reduce the complexity of the classification model and decrease its training time without sacrificing its performance [30]. This is particularly important when dealing with large scale systems generating big data [30].…”

Section: B Proposed Approach Applicationmentioning

confidence: 99%

“…This is done in an attempt to reduce the complexity of the classification model and decrease its training time without sacrificing its performance [30]. This is particularly important when dealing with large scale systems generating big data [30]. Three different feature selection mechanisms are considered in this work representing three different categories of feature selection algorithms.…”

Section: B Proposed Approach Applicationmentioning

confidence: 99%

Ensemble-based Feature Selection and Classification Model for DNS Typo-squatting Detection

Moubayed¹,

Aqeeli²,

Shami³

2020

Preprint

View full text Add to dashboard Cite

Domain Name System (DNS) plays in important role in the current IP-based Internet architecture. This is because it performs the domain name to IP resolution. However, the DNS protocol has several security vulnerabilities due to the lack of data integrity and origin authentication within it. This paper focuses on one particular security vulnerability, namely typo-squatting. Typo-squatting refers to the registration of a domain name that is extremely similar to that of an existing popular brand with the goal of redirecting users to malicious/suspicious websites. The danger of typo-squatting is that it can lead to information threat, corporate secret leakage, and can facilitate fraud. This paper builds on our previous work in [1], which only proposed majorityvoting based classifier, by proposing an ensemble-based feature selection and bagging classification model to detect DNS typosquatting attack. Experimental results show that the proposed framework achieves high accuracy and precision in identifying the malicious/suspicious typo-squatting domains (a loss of at most 1.5% in accuracy and 5% in precision when compared to the model that used the complete feature set) while having a lower computational complexity due to the smaller feature set (a reduction of more than 50% in feature set size).

show abstract

“…This work compares between two different feature selection techniques, namely information gain-based and correlationbased feature selection, and explores their effect on the models' detection performance and time complexity. This is particularly relevant when designing ML models for large scale systems that generate high dimensional data [38].…”

Section: B Feature Selectionmentioning

confidence: 99%

“…The second stage of the proposed framework is conducting a feature selection process to reduce the number of features needed for the ML classification model. This is done to reduce the time complexity of the classification model and consequently decrease its training time without sacrificing its performance [38]. With that in mind, two different methods are compared within this stage of the framework.…”

Section: ) Bayesian Optimizationmentioning

confidence: 99%

Multi-Stage Optimized Machine Learning Framework for Network Intrusion Detection

Injadat,

Moubayed,

Nassif

et al. 2020

Preprint

View full text Add to dashboard Cite

Cyber-security garnered significant attention due to the increased dependency of individuals and organizations on the Internet and their concern about the security and privacy of their online activities. Several previous machine learning (ML)-based network intrusion detection systems (NIDSs) have been developed to protect against malicious online behavior. This paper proposes a novel multi-stage optimized ML-based NIDS framework that reduces computational complexity while maintaining its detection performance. This work studies the impact of oversampling techniques on the models' training sample size and determines the minimal suitable training sample size. Furthermore, it compares between two feature selection techniques, information gain and correlation-based, and explores their effect on detection performance and time complexity. Moreover, different ML hyperparameter optimization techniques are investigated to enhance the NIDS's performance. The performance of the proposed framework is evaluated using two recent intrusion detection datasets, the CICIDS 2017 and the UNSW-NB 2015 datasets. Experimental results show that the proposed model significantly reduces the required training sample size (up to 74%) and feature set size (up to 50%). Moreover, the model performance is enhanced with hyper-parameter optimization with detection accuracies over 99% for both datasets, outperforming recent literature works by 1-2% higher accuracy and 1-2% lower false alarm rate.

show abstract

Data Feature Selection Methods on Distributed Big Data Processing Platforms

Cited by 12 publications

References 0 publications

Optimized Random Forest Model for Botnet Detection Based on DNS Queries

Optimized Random Forest Model for Botnet Detection Based on DNS Queries

Ensemble-based Feature Selection and Classification Model for DNS Typo-squatting Detection

Multi-Stage Optimized Machine Learning Framework for Network Intrusion Detection

Contact Info

Product

Resources

About