Data Preparation for Software Vulnerability Prediction: A Systematic Literature Review

Croft, Roland; Xie, Yongzheng; Babar, M. Ali

doi:10.48550/arxiv.2109.05740

Cited by 3 publications

(8 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mining software repositories has become a popular area of empirical software engineering research [36]. However, the use of these data sources does not come without perils; the data is not necessarily clean and can exhibit significant noise [37], [38]. Our study contributes to this body of knowledge by highlighting a unique data quality issue that is present in SV reporting data sources; SV severity ranking inconsistency.…”

Section: A Vulnerability Report Inconsistenciesmentioning

confidence: 87%

An Investigation into Inconsistency of Software Vulnerability Severity across Data Sources

Croft¹,

Babar²,

Li³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Software Vulnerability (SV) severity assessment is a vital task for informing SV remediation and triage. Ranking of SV severity scores is often used to advise prioritization of patching efforts. However, severity assessment is a difficult and subjective manual task that relies on expertise, knowledge, and standardized reporting schemes. Consequently, different data sources that perform independent analysis may provide conflicting severity rankings. Inconsistency across these data sources affects the reliability of severity assessment data, and can consequently impact SV prioritization and fixing. In this study, we investigate severity ranking inconsistencies over the SV reporting lifecycle. Our analysis helps characterize the nature of this problem, identify correlated factors, and determine the impacts of inconsistency on downstream tasks. Our findings observe that SV severity often lacks consideration or is underestimated during initial reporting, and such SVs consequently receive lower prioritization. We identify six potential attributes that are correlated to this misjudgment, and show that inconsistency in severity reporting schemes can severely degrade the performance of downstream severity prediction by up to 77%. Our findings help raise awareness of SV severity data inconsistencies and draw attention to this data quality problem. These insights can help developers better consider SV severity data sources, and improve the reliability of consequent SV prioritization. Furthermore, we encourage researchers to provide more attention to SV severity data selection.

show abstract

Section: A Vulnerability Report Inconsistenciesmentioning

confidence: 87%

An Investigation into Inconsistency of Software Vulnerability Severity across Data Sources

Croft¹,

Babar²,

Li³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Thus, to achieve a better and more practical solution, our study extends this knowledge by investigating the feasibility of a variety of NLL techniques. Furthermore, we are the first to analyse noise tolerance for security defect datasets, which have been suggested to exhibit even greater data quality issues than regular defect datasets [10].…”

Section: Noise Tolerant Approaches For Defect Predictionmentioning

confidence: 99%

“…We firstly opted to conduct our classification at the file-level; we assigned each source code file a label as to whether it contains a reported vulnerability or not. The majority of SVP research has been conducted at the file-level [10], but recent state-of-the-art models have moved towards finer granularities [39]. However, we 2 https://cwe.mitre.org/top25/archive/2021/2021_cwe_top25.html choose to retain our prediction at the file-level, as further localisation of vulnerable code within files would introduce additional noise and distrust in our positive labels.…”

Section: Software Vulnerability Predictionmentioning

confidence: 99%

“…Firstly, class imbalance is a significant issue prevalent in SV datasets [10], that negatively influences prediction capabilities of data-driven models. Although solutions to class imbalance exist, such as rebalancing and reweighting, both of which we have investigated and implemented, they are far from complete.…”

Section: Difficulties In Adoptionmentioning

confidence: 99%

“…Defect data preparation requires code modules to be labeled as clean or defective. To achieve this, researchers typically collect reported post-release software defects for identifying the faulty code modules [10]. The label correctness is inherently critical for training and evaluation of a prediction model [2], and mislabeled instances can heavily influence research outcomes [31,60].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Noisy Label Learning for Security Defects

Croft¹,

Babar²,

Chen³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which improves AUC and recall of baselines by up to 8.9% and 23.4%, respectively. Moreover, we discuss several hurdles in terms of achieving a performance upper bound with semi-omniscient knowledge of the label noise. Overall, the experimental results show that learning from noisy labels can be effective for data-driven software and security analytics.

show abstract