D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Zheng, Yunhui; Pujar, Saurabh; Lewis, Burn L.; Buratti, Luca; Epstein, Edward S.; Yang, Bo; Laredo, Jim; Morari, Alessandro; Su, Zhongqing

doi:10.1109/icse-seip52600.2021.00020

Cited by 80 publications

(31 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this may not always be feasible as models may need to be constructed straight away for continuous bug prediction [64]. Zheng et al [71] recently proposed the D2A dataset that used static analysis and manual verification for additional labeling indicators. Whilst these efforts can help to uncover latent vulnerabilities in a more timely way, it is still limited to known vulnerabilities.…”

Section: Motivationmentioning

confidence: 99%

Noisy Label Learning for Security Defects

Croft¹,

Babar²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which improves AUC and recall of baselines by up to 8.9% and 23.4%, respectively. Moreover, we discuss several hurdles in terms of achieving a performance upper bound with semi-omniscient knowledge of the label noise. Overall, the experimental results show that learning from noisy labels can be effective for data-driven software and security analytics.

show abstract

Section: Motivationmentioning

confidence: 99%

Noisy Label Learning for Security Defects

Croft¹,

Babar²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…RE-VEAL [25] leverages the SMOTE re-sampling and duplicate removal method to address the problem of imbalanced datasets. D2A [26] proposes a curated benchmark dataset based on a differential analysis approach, by analysing version pairs of source code from multiple open-source projects.…”

Section: A Vulnerability Detection Using Deep Learningmentioning

confidence: 99%

“…6) D2A: The D2A dataset is a real-world vulnerability detection dataset curated and introduced by the IBM Research team [26]. This dataset consists of several open-source software projects like FFmpeg, httpd, Libav, LibTIFF, Nginx and OpenSSL.…”

Section: ) Muvuldeepecker (Mvd)mentioning

confidence: 99%

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

Hanif¹,

Maffeis²

2022

Preprint

View full text Add to dashboard Cite

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

show abstract

“…A wide variety of datasets for source code exist, with many targeting one or a small number of tasks. Such tasks include clone detection, vulnerability detection [7,8], cloze test [9], code completion [10,11], code repair [12], code-to-code translation, natural language code search [13], text-to-code generation [14], and code summarization [13]. A detailed discussion of these tasks and their respective datasets is available in [15].…”

Section: Related Datasetsmentioning

confidence: 99%

“…The sequence-of-tokens representation can be used with other neural networks of increasing capacity. We build a C-BERT model (a transformer model introduced in C-BERT achieves appealing results on binary classification and vulnerability detection with C source code [7,32]. However, it has not been used on multiclass classification tasks or with other languages such as C++, Java, and Python.…”

Section: B3 C-bert With Token Sequencementioning

confidence: 99%

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Puri,

Kung,

Janssen

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Advancements in deep learning and machine learning algorithms have enabled breakthrough progress in computer vision, speech recognition, natural language processing and beyond. In addition, over the last several decades, software has been built into the fabric of every aspect of our society. Together, these two trends have generated new interest in the fast-emerging research area of "AI for Code". As software development becomes ubiquitous across all industries and code infrastructure of enterprise legacy applications ages, it is more critical than ever to increase software development productivity and modernize legacy applications. Over the last decade, datasets like ImageNet, with its large scale and diversity, have played a pivotal role in algorithmic advancements from computer vision to language and speech understanding. In this paper, we present "Project CodeNet", a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for Code. It consists of 14M code samples and about 500M lines of code in 55 different programming languages. Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety programming languages, to advances in code performance (both runtime, and memory) improvement techniques. CodeNet also provides sample input and output test sets for over 7M code samples, which can be critical for determining code equivalence in different languages. As a usability feature, we provide several preprocessing tools in Project CodeNet to transform source codes into representations that can be readily used as inputs into machine learning models.Preprint. Under review.

show abstract

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Cited by 80 publications

References 28 publications

Noisy Label Learning for Security Defects

Noisy Label Learning for Security Defects

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Contact Info

Product

Resources

About