Are public intrusion datasets fit for purpose characterising the state of the art in intrusion event datasets

Kenyon, Anthony J.; Deka, Lipika; Elizondo, David

doi:10.1016/j.cose.2020.102022

Cited by 24 publications

(9 citation statements)

References 46 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, this data source lacks ground truth information such that generated clusters by GAC or our SOAAPR approach cannot be evaluated for their accuracy. Since we are interested in the complete processing pipeline starting from the detection of outliers in X t by the online OD algorithms, we set our focus on recent IDS datasets such as CICIDS2017 [61] and CSE-CIC-IDS2018 (License: https://registry.opendata.aws/cse-cic-ids2018/ (accessed on 25 June 2021) [61] provided by the University of New Brunswick on AWS or UNSW-NB15 since long-serving and still widely used datasets, such as KDD Cup 99 (http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html (accessed on 25 June 2021) or NSL-KDD (https://www.unb.ca/ cic/datasets/nsl.html (accessed on 25 June 2021), have been criticized by many researchers over the past couple of years [61,69]. Especially for the evaluation of anomaly-based IDS methods, the latest updated datasets, such as CSE-CIC-IDS2018, should be utilized [70].…”

Section: Data Sourcementioning

confidence: 99%

Exploiting the Outcome of Outlier Detection for Novel Attack Pattern Recognition on Streaming Data

et al. 2021

View full text Add to dashboard Cite

Future-oriented networking infrastructures are characterized by highly dynamic Streaming Data (SD) whose volume, speed and number of dimensions increased significantly over the past couple of years, energized by trends such as Software-Defined Networking or Artificial Intelligence. As an essential core component of network security, Intrusion Detection Systems (IDS) help to uncover malicious activity. In particular, consecutively applied alert correlation methods can aid in mining attack patterns based on the alerts generated by IDS. However, most of the existing methods lack the functionality to deal with SD data affected by the phenomenon called concept drift and are mainly designed to operate on the output from signature-based IDS. Although unsupervised Outlier Detection (OD) methods have the ability to detect yet unknown attacks, most of the alert correlation methods cannot handle the outcome of such anomaly-based IDS. In this paper, we introduce a novel framework called Streaming Outlier Analysis and Attack Pattern Recognition, denoted as SOAAPR, which is able to process the output of various online unsupervised OD methods in a streaming fashion to extract information about novel attack patterns. Three different privacy-preserving, fingerprint-like signatures are computed from the clustered set of correlated alerts by SOAAPR, which characterizes and represents the potential attack scenarios with respect to their communication relations, their manifestation in the data's features and their temporal behavior. Beyond the recognition of known attacks, comparing derived signatures, they can be leveraged to find similarities between yet unknown and novel attack patterns. The evaluation, which is split into two parts, takes advantage of attack scenarios from the widely-used and popular CICIDS2017 and CSE‐CIC‐IDS2018 datasets. Firstly, the streaming alert correlation capability is evaluated on CICIDS2017 and compared to a state-of-the-art offline algorithm, called Graph-based Alert Correlation (GAC), which has the potential to deal with the outcome of anomaly-based IDS. Secondly, the three types of signatures are computed from attack scenarios in the datasets and compared to each other. The discussion of results, on the one hand, shows that SOAAPR can compete with GAC in terms of alert correlation capability leveraging four different metrics and outperforms it significantly in terms of processing time by an average factor of 70 in 11 attack scenarios. On the other hand, in most cases, all three types of signatures seem to reliably characterize attack scenarios such that similar ones are grouped together, with up to 99.05\% similarity between the FTP and SSH Patator attack.intrusion detection; alert analysis; alert correlation; outlier detection; attack scenario; streaming data; network security

show abstract

Section: Data Sourcementioning

confidence: 99%

Exploiting the Outcome of Outlier Detection for Novel Attack Pattern Recognition on Streaming Data

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Rather than relying on a security-domain specific single dataset such as KDD'99, NSL-KDD or a predestined younger one CSE-CIC-IDS2018 (https://registry.opendata.aws/csecic-ids2018/, accessed on 5 March 2021), we have deliberately chosen real-world candidate datasets from the ODDS (http://odds.cs.stonybrook.edu/about-odds/, accessed on 5 March 2021) (Outlier Detection DataSets) Library [65] which are commonly used to evaluate OD methods for various reasons. In recent years, the majority of state-of-the-art IDS datasets have been criticized by many researchers since their data is out of date or do not represent today's threat landscape [51,66,67]. Even if CSE-CIC-IDS2018 overcomes some shortcomings, it was not optimal for the extensive number of measurements performed (Figure 5) due to its enormous number of data instances in multiple files.…”

Section: Data Sourcementioning

confidence: 99%

Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

et al. 2021

View full text Add to dashboard Cite

Over the past couple of years, machine learning methods—especially the outlier detection ones—have anchored in the cybersecurity field to detect network-based anomalies rooted in novel attack patterns. However, the ubiquity of massive continuously generated data streams poses an enormous challenge to efficient detection schemes and demands fast, memory-constrained online algorithms that are capable to deal with concept drifts. Feature selection plays an important role when it comes to improve outlier detection in terms of identifying noisy data that contain irrelevant or redundant features. State-of-the-art work either focuses on unsupervised feature selection for data streams or (offline) outlier detection. Substantial requirements to combine both fields are derived and compared with existing approaches. The comprehensive review reveals a research gap in unsupervised feature selection for the improvement of outlier detection methods in data streams. Thus, a novel algorithm for Unsupervised Feature Selection for Streaming Outlier Detection, denoted as UFSSOD, will be proposed, which is able to perform unsupervised feature selection for the purpose of outlier detection on streaming data. Furthermore, it is able to determine the amount of top-performing features by clustering their score values. A generic concept that shows two application scenarios of UFSSOD in conjunction with off-the-shell online outlier detection algorithms has been derived. Extensive experiments have shown that a promising feature selection mechanism for streaming data is not applicable in the field of outlier detection. Moreover, UFSSOD, as an online capable algorithm, yields comparable results to a state-of-the-art offline method trimmed for outlier detection.

show abstract

“…When choosing a dataset to train or test a SNIDS it is necessary to consider the representiveness and accuracy of the data events. Obtaining representative, accurate, useful and correct-labeled network traffic data is significantly challenging, and maintaining such data sets is often impractical [36]. Many organizations that have the ability to generate and publish useful data are very protective of such information; not least because publishing traffic data has the potential to expose sensitive information.…”

Section: Network Traffic Datasetsmentioning

confidence: 99%

“…The datasets mentioned in Table 1 shows a considerable diversity in terms of the number of captured attacks. As stated by [36], many of the available public labeled datasets for research are static. They represent the network behavior just for a particular time period.…”

Section: Namementioning

confidence: 99%

See 1 more Smart Citation

Datasets are not Enough: Challenges in Labeling Network Traffic

Guerra¹,

Catania²,

Veas³

2021

Preprint

View full text Add to dashboard Cite

In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.

show abstract

Are public intrusion datasets fit for purpose characterising the state of the art in intrusion event datasets

Cited by 24 publications

References 46 publications

Exploiting the Outcome of Outlier Detection for Novel Attack Pattern Recognition on Streaming Data

Exploiting the Outcome of Outlier Detection for Novel Attack Pattern Recognition on Streaming Data

Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

Datasets are not Enough: Challenges in Labeling Network Traffic

Contact Info

Product

Resources

About