Data Quality for Software Vulnerability Datasets

Croft, Roland; Babar, M. Ali; Kholoosi, Mehdi

doi:10.48550/arxiv.2301.05456

Cited by 2 publications

(2 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have also noticed this in our own experiments. Because of this, the quality of the data included in vulnerability datasets has been the subject of previous research [16]. Security data quality transcends vulnerability detection: security bug report prediction has also gained traction and data quality matters for this type of insight too [39,41].…”

Section: Related Workmentioning

confidence: 99%

Toward Improved Deep Learning-based Vulnerability Detection

Sejfia,

Das,

Shafiq

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

Deep learning (DL) has been a common thread across several recent techniques for vulnerability detection. The rise of large, publicly available datasets of vulnerabilities has fueled the learning process underpinning these techniques. While these datasets help the DL-based vulnerability detectors, they also constrain these detectors' predictive abilities. Vulnerabilities in these datasets have to be represented in a certain way, e.g., code lines, functions, or program slices within which the vulnerabilities exist. We refer to this representation as a base unit. The detectors learn how base units can be vulnerable and then predict whether other base units are vulnerable. We have hypothesized that this focus on individual base units harms the ability of the detectors to properly detect those vulnerabilities that span multiple base units (or MBU vulnerabilities). For vulnerabilities such as these, a correct detection occurs when all comprising base units are detected as vulnerable. Verifying how existing techniques perform in detecting all parts of a vulnerability is important to establish their effectiveness for other downstream tasks. To evaluate our hypothesis, we conducted a study focusing on three prominent DL-based detectors: ReVeal, DeepWukong, and LineVul. Our study shows that all three detectors contain MBU vulnerabilities in their respective datasets. Further, we observed significant accuracy drops when detecting these types of vulnerabilities. We present our study and a framework that can be used to help DL-based detectors toward the proper inclusion of MBU vulnerabilities.

show abstract

Section: Related Workmentioning

confidence: 99%

Toward Improved Deep Learning-based Vulnerability Detection

Sejfia,

Das,

Shafiq

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

show abstract

“…Orthogonal to label errors, prior work has also observed non-trivial overlap between test and train splits in datasets on which natural language processing and computer vision models are evaluated (e.g., Finegan-Dollak et al, 2018;Allamanis, 2019;Barz and Denzler, 2020;Lewis et al, 2021;Wen et al, 2022;Croft et al, 2023). Such work often argues that non-trivial amounts of overlap between test and train data can lead to "inflated" performance scores, as overlapping data can reward a model's ability to memorize training data (Elangovan et al, 2021), and to under-estimate out-of-sample error (Søgaard et al, 2021).…”

Section: Analysis Of Datasetsmentioning

confidence: 99%

On Evaluation of Document Classification with RVL-CDIP

Larson,

Lim,

Leach

2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

The RVL-CDIP benchmark is widely used for measuring performance on the task of document classification. Despite its widespread use, we reveal several undesirable characteristics of the RVL-CDIP benchmark. These include (1) substantial amounts of label noise, which we estimate to be 8.1% (ranging between 1.6% to 16.9% per document category); (2) presence of many ambiguous or multi-label documents; (3) a large overlap between test and train splits, which can inflate model performance metrics; and (4) presence of sensitive personally-identifiable information like US Social Security numbers (SSNs). We argue that there is a risk in using RVL-CDIP for benchmarking document classifiers, as its limited scope, presence of errors (state-of-the-art models now achieve accuracy error rates that are within our estimated label error rate), and lack of diversity make it less than ideal for benchmarking. We further advocate for the creation of a new document classification benchmark, and provide recommendations for what characteristics such a resource should include.

show abstract

Data Quality for Software Vulnerability Datasets

Cited by 2 publications

References 55 publications

Toward Improved Deep Learning-based Vulnerability Detection

Toward Improved Deep Learning-based Vulnerability Detection

On Evaluation of Document Classification with RVL-CDIP

Contact Info

Product

Resources

About