2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) 2021
DOI: 10.1109/icse-seip52600.2021.00020
|View full text |Cite
|
Sign up to set email alerts
|

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Abstract: Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 80 publications
(31 citation statements)
references
References 28 publications
0
21
0
Order By: Relevance
“…However, this may not always be feasible as models may need to be constructed straight away for continuous bug prediction [64]. Zheng et al [71] recently proposed the D2A dataset that used static analysis and manual verification for additional labeling indicators. Whilst these efforts can help to uncover latent vulnerabilities in a more timely way, it is still limited to known vulnerabilities.…”
Section: Motivationmentioning
confidence: 99%
“…However, this may not always be feasible as models may need to be constructed straight away for continuous bug prediction [64]. Zheng et al [71] recently proposed the D2A dataset that used static analysis and manual verification for additional labeling indicators. Whilst these efforts can help to uncover latent vulnerabilities in a more timely way, it is still limited to known vulnerabilities.…”
Section: Motivationmentioning
confidence: 99%
“…RE-VEAL [25] leverages the SMOTE re-sampling and duplicate removal method to address the problem of imbalanced datasets. D2A [26] proposes a curated benchmark dataset based on a differential analysis approach, by analysing version pairs of source code from multiple open-source projects.…”
Section: A Vulnerability Detection Using Deep Learningmentioning
confidence: 99%
“…6) D2A: The D2A dataset is a real-world vulnerability detection dataset curated and introduced by the IBM Research team [26]. This dataset consists of several open-source software projects like FFmpeg, httpd, Libav, LibTIFF, Nginx and OpenSSL.…”
Section: ) Muvuldeepecker (Mvd)mentioning
confidence: 99%
“…A wide variety of datasets for source code exist, with many targeting one or a small number of tasks. Such tasks include clone detection, vulnerability detection [7,8], cloze test [9], code completion [10,11], code repair [12], code-to-code translation, natural language code search [13], text-to-code generation [14], and code summarization [13]. A detailed discussion of these tasks and their respective datasets is available in [15].…”
Section: Related Datasetsmentioning
confidence: 99%
“…The sequence-of-tokens representation can be used with other neural networks of increasing capacity. We build a C-BERT model (a transformer model introduced in C-BERT achieves appealing results on binary classification and vulnerability detection with C source code [7,32]. However, it has not been used on multiclass classification tasks or with other languages such as C++, Java, and Python.…”
Section: B3 C-bert With Token Sequencementioning
confidence: 99%