Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases

Lin, Guanjun; Zhang, Jun; Luo, Wei; Pan, Lei; Vel, Olivier De; Montague, Paul; Xiang, Yang

doi:10.1109/tdsc.2019.2954088

Cited by 80 publications

(110 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As summarized in Table 3 Competitive methods. For this case study, we compare Poem against two state-of-the-art deep-learning-based vulnerability detection models: uVuldeepecker [66] and Lin et al [40]. Results of this case study is presented in Section 4.4.…”

Section: Case Study 4: Vulnerability Detectionmentioning

confidence: 99%

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning

Tang

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Deep learning is emerging as a promising technique for building predictive models to support code-related tasks like performance optimization and code vulnerability detection. One of the critical aspects of building a successful predictive model is having the right representation to characterize the model input for the given task. Existing approaches in the area typically treat the program structure as a sequential sequence but fail to capitalize on the rich semantics of data and control flow information, for which graphs are a proven representation structure. We present Poem 1 , a novel framework that automatically learns useful code representations from graph-based program structures. At the core of Poem is a graph neural network (GNN) that is specially designed for capturing the syntax and semantic information from the program abstract syntax tree and the control and data flow graph. As a departure from existing GNN-based code modeling techniques, our network simultaneously learns over multiple relations of a program graph. This capability enables the learning framework to distinguish and reason about the diverse code relationships, be it a data or a control flow or any other relationships that may be important for the downstream processing task. We apply Poem to four representative tasks that require a strong ability to reason about the program structure: heterogeneous device mapping, parallel thread coarsening, loop vectorization and code vulnerability detection. We evaluate Poem on programs written in OpenCL, C, Java and Swift, and compare it against nine learningbased methods. Experimental results show that Poem consistently outperforms all competing methods across evaluation settings.

show abstract

Section: Case Study 4: Vulnerability Detectionmentioning

confidence: 99%

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning

Tang

Wang

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…A machine-learned model can then be applied to new software projects to identify potentially vulnerable code that exhibits similar patterns as those vulnerable samples seen in the training data. There is now ample evidence showing that machine learning techniques can exceed expert-crafted rules [3] for detecting common code vulnerabilities or bugs.…”

Section: Introductionmentioning

confidence: 99%

“…Recent studies have leveraged deep learning (DL) to reason about program structures to identify potential software vulnerabilities at the source code [4,5,6,3,7]. Compared to classical machine learning techniques, DL has the advantage of not requiring expert involvement to tune representations for program structures manually; instead, it automatically captures and determines them from training samples.…”

Section: Introductionmentioning

confidence: 99%

“…Existing DL-based approaches for program modeling typically use recurrent neural networks (RNNs) such as the Long Short-Term Memory (LSTM) or a variant of it [5,6,8,3,7]. These approaches work by treating source code and its corresponding program structure, such as the abstract syntax tree (AST), as a sequence of tokens.…”

Section: Introductionmentioning

confidence: 99%

“…While our novel GNN extension provides a potentially powerful capability for learning code representation, its potential can only be unlocked with sufficient training data. Typical DL algorithms require up to millions of examples to learn an efficient model [16], but the scarcity of real-life vulnerable training samples is a common problem [3]. The lack of training data limits the quality of machine-learned detection models, as they have very sparse training data for typical high-dimensional program space.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

Wang

Tang

et al. 2021

IEEE Trans.Inform.Forensic Secur.

164

View full text Add to dashboard Cite

Intelligent detection of vulnerable functions in software through neural embedding‐based code analysis

et al. 2022

View full text Add to dashboard Cite

Summary Software vulnerability is a fundamental problem in cybersecurity, which poses severe threats to the secure operation of devices and systems. In this paper, we propose a new vulnerability detection framework of employing advanced neural embedding. For example, CodeBERT is a large‐scale pre‐trained embedding model for natural language and programming language. It achieves state‐of‐the‐art performance on various natural language processing and code analysis tasks, demonstrating improved generalization ability compared with conventional models. The proposed framework encapsulates CodeBERT as a code representation generator and combines it with transfer learning to conduct cross‐project vulnerability detection. Considering the problem of lacking code embedding models on C source code, we extract the knowledge from C source code to fine‐tune the pre‐trained embedding model, so as to better facilitate the detection of function‐level vulnerabilities in C open‐source projects. To address the severe data imbalance issue in real‐world scenarios, we introduce code argumentation idea and use a large number of synthetic vulnerability data to further improve the robustness of the detection method. Experimental results show that the proposed vulnerability detection framework achieves better performance than existing methods.

show abstract

Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases

Cited by 80 publications

References 30 publications

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

Intelligent detection of vulnerable functions in software through neural embedding‐based code analysis

Contact Info

Product

Resources

About