2020
DOI: 10.1016/j.cose.2020.102022
|View full text |Cite
|
Sign up to set email alerts
|

Are public intrusion datasets fit for purpose characterising the state of the art in intrusion event datasets

Abstract: In recent years cybersecurity attacks have caused major disruption and information loss for online organisations, with high profile incidents in the news. One of the key challenges in advancing the state of the art in intrusion detection is the lack of representative datasets. These datasets typically contain millions of time-ordered events (e.g. network packet traces, flow summaries, log entries); subsequently analysed to identify abnormal behavior and specific attacks [1]. Generating realistic datasets has h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(9 citation statements)
references
References 46 publications
(72 reference statements)
0
8
0
1
Order By: Relevance
“…However, this data source lacks ground truth information such that generated clusters by GAC or our SOAAPR approach cannot be evaluated for their accuracy. Since we are interested in the complete processing pipeline starting from the detection of outliers in X t by the online OD algorithms, we set our focus on recent IDS datasets such as CICIDS2017 [61] and CSE-CIC-IDS2018 (License: https://registry.opendata.aws/cse-cic-ids2018/ (accessed on 25 June 2021) [61] provided by the University of New Brunswick on AWS or UNSW-NB15 since long-serving and still widely used datasets, such as KDD Cup 99 (http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html (accessed on 25 June 2021) or NSL-KDD (https://www.unb.ca/ cic/datasets/nsl.html (accessed on 25 June 2021), have been criticized by many researchers over the past couple of years [61,69]. Especially for the evaluation of anomaly-based IDS methods, the latest updated datasets, such as CSE-CIC-IDS2018, should be utilized [70].…”
Section: Data Sourcementioning
confidence: 99%
“…However, this data source lacks ground truth information such that generated clusters by GAC or our SOAAPR approach cannot be evaluated for their accuracy. Since we are interested in the complete processing pipeline starting from the detection of outliers in X t by the online OD algorithms, we set our focus on recent IDS datasets such as CICIDS2017 [61] and CSE-CIC-IDS2018 (License: https://registry.opendata.aws/cse-cic-ids2018/ (accessed on 25 June 2021) [61] provided by the University of New Brunswick on AWS or UNSW-NB15 since long-serving and still widely used datasets, such as KDD Cup 99 (http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html (accessed on 25 June 2021) or NSL-KDD (https://www.unb.ca/ cic/datasets/nsl.html (accessed on 25 June 2021), have been criticized by many researchers over the past couple of years [61,69]. Especially for the evaluation of anomaly-based IDS methods, the latest updated datasets, such as CSE-CIC-IDS2018, should be utilized [70].…”
Section: Data Sourcementioning
confidence: 99%
“…Rather than relying on a security-domain specific single dataset such as KDD'99, NSL-KDD or a predestined younger one CSE-CIC-IDS2018 (https://registry.opendata.aws/csecic-ids2018/, accessed on 5 March 2021), we have deliberately chosen real-world candidate datasets from the ODDS (http://odds.cs.stonybrook.edu/about-odds/, accessed on 5 March 2021) (Outlier Detection DataSets) Library [65] which are commonly used to evaluate OD methods for various reasons. In recent years, the majority of state-of-the-art IDS datasets have been criticized by many researchers since their data is out of date or do not represent today's threat landscape [51,66,67]. Even if CSE-CIC-IDS2018 overcomes some shortcomings, it was not optimal for the extensive number of measurements performed (Figure 5) due to its enormous number of data instances in multiple files.…”
Section: Data Sourcementioning
confidence: 99%
“…When choosing a dataset to train or test a SNIDS it is necessary to consider the representiveness and accuracy of the data events. Obtaining representative, accurate, useful and correct-labeled network traffic data is significantly challenging, and maintaining such data sets is often impractical [36]. Many organizations that have the ability to generate and publish useful data are very protective of such information; not least because publishing traffic data has the potential to expose sensitive information.…”
Section: Network Traffic Datasetsmentioning
confidence: 99%
“…The datasets mentioned in Table 1 shows a considerable diversity in terms of the number of captured attacks. As stated by [36], many of the available public labeled datasets for research are static. They represent the network behavior just for a particular time period.…”
Section: Namementioning
confidence: 99%
See 1 more Smart Citation