Noisy Label Learning for Security Defects

Croft, Roland; Babar, M. Ali; Chen, Huaming

doi:10.48550/arxiv.2203.04468

Cited by 1 publication

(7 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In reality, complete knowledge of the latent vulnerabilities is unobtainable. Croft et al [13] observed at least twice as many latent vulnerabilities as known vulnerabilities in their dataset.…”

Section: Consistencymentioning

confidence: 93%

“…CodeBERT is a pre-trained state-of-the-art code embedding model based on the RoBERTa architecture [52]. Similar studies have demonstrated the effectiveness of Code-BERT for SVP [8], [13]. LineVul generates function-level predictions using a transformer-based architecture.…”

Section: Validating Attribute Impactmentioning

confidence: 99%

“…Hence, data quality issues will hinder the reliability and trustworthiness of the outcomes. For instance, previous studies [13], [33] have highlighted inflated performance due to inaccurate labelling mechanisms for nonvulnerable modules. Consequently, the industry value and adoption of SVP models is uncertain [37], [38].…”

Section: Background and Motivationmentioning

confidence: 99%

“…This largely relates to the semantic label correctness; i.e., whether or not data points labelled as vulnerable or non-vulnerable genuinely align. It has previously been observed that non-vulnerable labels are unreliable in real-world datasets as there is no ground truth label source for this class [10], [13], [33]. No oracle can reliably ensure the security and absence of exploits in a given code snippet.…”

Section: A Accuracymentioning

confidence: 99%

“…Nonetheless, software vulnerability data collection is not a trivial task [10]. Labelled examples of software vulnerabilities are difficult to obtain in the real-world, as they are scarce [11], poorly documented [12], and limited to reported vulnerabilities [13]. Consequently, many researchers have conducted labourious work constructing large-scale software vulnerability datasets [14]- [17].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Data Quality for Software Vulnerability Datasets

Croft¹,

Babar²,

Kholoosi³

2023

Preprint

View full text Add to dashboard Cite

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems.In particular, we found 20-71% of vulnerability labels to be inaccurate in real-world datasets, and 17-99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

show abstract

Section: Consistencymentioning

confidence: 93%