Data Preparation for Software Vulnerability Prediction: A Systematic Literature Review

Croft, Roland; Xie, Yongzheng; Babar, Muhammad Ali

doi:10.1109/tse.2022.3171202

Cited by 25 publications

(11 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Croft et al in their work compile the processes of data preparation for vulnerability detection [18]. In fact, they do refer to the usage of base units and how that constrains vulnerability detection approaches, though they do not refer to a base unit by that name.…”

Section: Related Workmentioning

confidence: 99%

Toward Improved Deep Learning-based Vulnerability Detection

Sejfia,

Das,

Shafiq

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

Deep learning (DL) has been a common thread across several recent techniques for vulnerability detection. The rise of large, publicly available datasets of vulnerabilities has fueled the learning process underpinning these techniques. While these datasets help the DL-based vulnerability detectors, they also constrain these detectors' predictive abilities. Vulnerabilities in these datasets have to be represented in a certain way, e.g., code lines, functions, or program slices within which the vulnerabilities exist. We refer to this representation as a base unit. The detectors learn how base units can be vulnerable and then predict whether other base units are vulnerable. We have hypothesized that this focus on individual base units harms the ability of the detectors to properly detect those vulnerabilities that span multiple base units (or MBU vulnerabilities). For vulnerabilities such as these, a correct detection occurs when all comprising base units are detected as vulnerable. Verifying how existing techniques perform in detecting all parts of a vulnerability is important to establish their effectiveness for other downstream tasks. To evaluate our hypothesis, we conducted a study focusing on three prominent DL-based detectors: ReVeal, DeepWukong, and LineVul. Our study shows that all three detectors contain MBU vulnerabilities in their respective datasets. Further, we observed significant accuracy drops when detecting these types of vulnerabilities. We present our study and a framework that can be used to help DL-based detectors toward the proper inclusion of MBU vulnerabilities.

show abstract

Section: Related Workmentioning

confidence: 99%

Toward Improved Deep Learning-based Vulnerability Detection

Sejfia,

Das,

Shafiq

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

show abstract

“…However, existing vulnerability datasets have been reported to exhibit varying degrees of quality issues such as noisy labels and duplication. To reduce the likelihood of experiment biases, following Croft et al's [19] standard practice, we employ two experienced security experts to manually confirm the correctness of vulnerability labels, and leverage a code clone detector to remove duplicate samples. Threats to External Validity refer to the generalizability of our approach.…”

Section: Threats To Validitymentioning

confidence: 99%

Coca: Improving and Explaining Graph Neural Network-Based Vulnerability Detection Systems

Cao,

Sun,

et al. 2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

Recently, Graph Neural Network (GNN)-based vulnerability detection systems have achieved remarkable success. However, the lack of explainability poses a critical challenge to deploy blackbox models in security-related domains. For this reason, several approaches have been proposed to explain the decision logic of the detection model by providing a set of crucial statements positively contributing to its predictions. Unfortunately, due to the weaklyrobust detection models and suboptimal explanation strategy, they have the danger of revealing spurious correlations and redundancy issue.In this paper, we propose Coca, a general framework aiming to 1) enhance the robustness of existing GNN-based vulnerability detection models to avoid spurious explanations; and 2) provide both concise and effective explanations to reason about the detected vulnerabilities. Coca consists of two core parts referred to as Trainer and Explainer. The former aims to train a detection model which is robust to random perturbation based on combinatorial contrastive learning, while the latter builds an explainer to derive crucial code statements that are most decisive to the detected vulnerability via dual-view causal inference as explanations. We

show abstract

“…Авторы проанализировали 180 исследований, их выводы следующие: • в уязвимостях программного обеспечения существуют две основные области исследования: прогнозирование уязвимых компонентов программного обеспечения и прогнозирование новых уязвимостей программного обеспечения; • большинство исследований в области уязвимостей создают собственные наборы данных, собирая информацию из баз данных уязвимостей, содержащих данные о реальном программном обеспечении; • наблюдается увеличение интереса к моделям глубокого обучения и сдвиг к текстовому представлению исходного кода. В [13] представлен обзор литературы по подготовке данных для прогнозирования уязвимостей программного обеспечения. Авторы рассмотрели 61 исследование и разработали таксономию подготовки данных для этой задачи.…”

Section: анализ работunclassified

Prediction of Vulnerability Categories in Configurations of Devices Using Artificial Intelligence Methods

2024

Cybersecurity Issues

View full text Add to dashboard Cite

The purpose of the study: investigation of the effectiveness of BERT modifications in solving the problem of predicting categories of vulnerabilities (CVE) for information system devices based on their configurations (CPE URIs).Research methods: natural language processing methods, cross-validation of artificial intelligence models, optimization of hyperparameters of artificial intelligence models.1 Левшун Дмитрий Сергеевич, кандидат технических наук, доктор философии компьютерных наук, старший научный сотрудник Лаборатории проблем компьютерной безопасности, ФГБУН «Санкт-Петербургский Федеральный исследовательский центр Российской академии наук» (СПб ФИЦ РАН), г.

show abstract

Data Preparation for Software Vulnerability Prediction: A Systematic Literature Review

Cited by 25 publications

References 59 publications

Toward Improved Deep Learning-based Vulnerability Detection

Toward Improved Deep Learning-based Vulnerability Detection

Coca: Improving and Explaining Graph Neural Network-Based Vulnerability Detection Systems

Prediction of Vulnerability Categories in Configurations of Devices Using Artificial Intelligence Methods

Contact Info

Product

Resources

About