2023
DOI: 10.48550/arxiv.2301.05456
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data Quality for Software Vulnerability Datasets

Abstract: The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 55 publications
0
1
0
Order By: Relevance
“…We have also noticed this in our own experiments. Because of this, the quality of the data included in vulnerability datasets has been the subject of previous research [16]. Security data quality transcends vulnerability detection: security bug report prediction has also gained traction and data quality matters for this type of insight too [39,41].…”
Section: Related Workmentioning
confidence: 99%
“…We have also noticed this in our own experiments. Because of this, the quality of the data included in vulnerability datasets has been the subject of previous research [16]. Security data quality transcends vulnerability detection: security bug report prediction has also gained traction and data quality matters for this type of insight too [39,41].…”
Section: Related Workmentioning
confidence: 99%
“…Orthogonal to label errors, prior work has also observed non-trivial overlap between test and train splits in datasets on which natural language processing and computer vision models are evaluated (e.g., Finegan-Dollak et al, 2018;Allamanis, 2019;Barz and Denzler, 2020;Lewis et al, 2021;Wen et al, 2022;Croft et al, 2023). Such work often argues that non-trivial amounts of overlap between test and train data can lead to "inflated" performance scores, as overlapping data can reward a model's ability to memorize training data (Elangovan et al, 2021), and to under-estimate out-of-sample error (Søgaard et al, 2021).…”
Section: Analysis Of Datasetsmentioning
confidence: 99%