The Effective Methods for Intrusion Detection With Limited Network Attack Data: Multi-Task Learning and Oversampling

Sun, Lijian; Zhou, Yun; Wang, Yanjuan; Zhu, Cheng; Zhang, Weiming

doi:10.1109/access.2020.3029100

Cited by 11 publications

(4 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, our solution yields a much higher recall rate than most of the previous works did. We also noted that several authors proposed to tackle the imbalanced data problem, such as [11], [21], [32]- [34]. Unfortunately, none of these methods supports learning on multiple data sources.…”

Section: B Performance Comparison With the State-of-the-art Workmentioning

confidence: 99%

Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

et al. 2022

View full text Add to dashboard Cite

As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and appropriate data segmentation. Improper handling of the issues will significantly degrade ML performance, e.g., resulting in high false-negative and low recall rates. Despite many efforts have done in the literature, detecting security attacks in a complicated network environment with imperfect data collection is still an open issue. This work proposes a machine learning framework with a combination of variational autoencoder and multilayer perceptron to deal with imbalanced datasets and detecting the explosion of attack variants on the Internet. The detection engine also includes an efficient range-based sequential search algorithm to address the segmentation challenge in data pre-processing from multiple sources (network packets, system/statistic logs) effectively. Our work is the first attempt to demonstrate the effect of using an appropriate combination of ML models for boosting IDS detection capability in a heterogeneous environment, where data collection imperfection is common. Experimental results on a public system log dataset (e.g., HDFS) show that our method gains approximately as much as 97% on F1 score and 98% on recall rate, a promising result compared to the same measurement of other solutions. Even better, we found that the proposed treatment of imbalanced datasets can improve up to 35% on the F1 score and 27% on recall rate. The testing results also indicate that our model can detect new attack variants. Code is available at https://github.com/tuonglinhhm/Hybrid-Learning-AutoEncoder-IDS.

show abstract

Section: B Performance Comparison With the State-of-the-art Workmentioning

confidence: 99%

Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

et al. 2022

View full text Add to dashboard Cite

show abstract

“…5 shows a schematic of such a workflow. CEF-SsL begins by splitting D in F and L: the former, F, is used exclusively to assess the performance on future data 14 ; the latter, L is used for all remaining 'training' operations, because L can serve as basis to generate L, and then treat the remaining samples as unlabelled, representing U.…”

Section: Stage One: Preparementioning

confidence: 99%

SoK: The Impact of Unlabelled Data in Cyberthreat Detection

Apruzzese,

Laskov,

Tastemirova

2022

Preprint

View full text Add to dashboard Cite

Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of MLbased CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data.This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for evaluation of SsL methods, which elucidates the contribution of unlabelled data. We review the state-of-the-art and observe that no previous work meets such requirements. To address this problem, we propose a framework for assessing the benefits of unlabelled data in SsL. We showcase an application of this framework by performing the first benchmark evaluation that highlights the tradeoffs of 9 existing SsL methods on 9 public datasets. Our findings verify that, in some cases, unlabelled data provides a small, but statistically significant, performance gain. This paper highlights that SsL in CTD has a lot of room for improvement, which should stimulate future research in this field.

show abstract

“…Zhou and Zhao [37] proposed the Clustered Multitask Learning (CMTL) approach, which described an arbitrary task with multiple representative tasks to give an accurate representation. Sun et al [38] combined MTL with oversampling for intrusion detection. eir method used MTL to learn relevant information from multiple tasks at the same time and then used the learned information for a single task.…”

Section: Model Algorithmsmentioning

confidence: 99%

A Survey of Few-Shot Learning: An Effective Method for Intrusion Detection

Duan

Tong

et al. 2021

Security and Communication Networks

View full text Add to dashboard Cite

Few-shot learning (FSL) is a core topic in the domain of machine learning (ML), in which the focus is on the use of small datasets to train the model. In recent years, there have been many important data-driven ML applications for intrusion detection. Despite these great achievements, however, gathering a large amount of reliable data remains expensive and time-consuming, or even impossible. In this regard, FSL has been shown to have advantages in terms of processing small, abnormal data samples in the huge application space of intrusion detection. FSL can improve ML for scarce data at three levels: the data, the model, and the algorithm levels. Previous knowledge plays an important role in all three approaches. Many promising methods such as data enrichment, the graph neural network model, and multitask learning have also been developed. In this paper, we present a comprehensive review of the latest research progress in the area of FSL. We first introduce the theoretical background to ML and FSL and then describe the general features, advantages, and main methods of FSL. FSL methods such as embedded learning, multitask learning, and generative models are applied to intrusion detection to improve the detection accuracy effectively. Then, the application of FSL to intrusion detection is reviewed in detail, including enriching the dataset by extracting intermediate features, using graph embedding and meta-learning methods to improve the model. Finally, the difficulties of this approach and its prospects for development in the field of intrusion detection are identified based on the previous discussion.

show abstract

The Effective Methods for Intrusion Detection With Limited Network Attack Data: Multi-Task Learning and Oversampling

Cited by 11 publications

References 33 publications

Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

SoK: The Impact of Unlabelled Data in Cyberthreat Detection

A Survey of Few-Shot Learning: An Effective Method for Intrusion Detection

Contact Info

Product

Resources

About