“…Due to the long-tailed distribution of CWE categories, we use three metrics, i.e., Macro F1, Weighted F1 and the multi-class version of Matthews Correlation Coefficient (MCC) [23], for evaluation. These metrics are also used by other vulnerability-related studies [28,38]. Macro F1 is the unweighted mean of the F1-scores of all categories, whereas Weighted F1 considers weighted mean.…”
Section: Methodsmentioning
confidence: 99%
“…It is crucial to detect, categorize and assess vulnerabilities. Due to the rapid increase in the number of software vulnerabilities and the success of deep learning techniques, researchers have proposed diverse deep-learning-based approaches to automate vulnerability analysis, such as vulnerability detection [14,68], classification [8,70], patch identification [66,69] and assessment [37,38], and achieved promising results.…”
Section: Introductionmentioning
confidence: 99%
“…Vulnerability assessment is a process that determines various characteristics of vulnerabilities and helps practitioners prioritize the remediation of critical vulnerabilities [37,38]. CVSS is a commonly used expert-based vulnerability assessment framework.…”
Vulnerability analysis is crucial for software security. Inspired by the success of pre-trained models on software engineering tasks, this work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program
“…Due to the long-tailed distribution of CWE categories, we use three metrics, i.e., Macro F1, Weighted F1 and the multi-class version of Matthews Correlation Coefficient (MCC) [23], for evaluation. These metrics are also used by other vulnerability-related studies [28,38]. Macro F1 is the unweighted mean of the F1-scores of all categories, whereas Weighted F1 considers weighted mean.…”
Section: Methodsmentioning
confidence: 99%
“…It is crucial to detect, categorize and assess vulnerabilities. Due to the rapid increase in the number of software vulnerabilities and the success of deep learning techniques, researchers have proposed diverse deep-learning-based approaches to automate vulnerability analysis, such as vulnerability detection [14,68], classification [8,70], patch identification [66,69] and assessment [37,38], and achieved promising results.…”
Section: Introductionmentioning
confidence: 99%
“…Vulnerability assessment is a process that determines various characteristics of vulnerabilities and helps practitioners prioritize the remediation of critical vulnerabilities [37,38]. CVSS is a commonly used expert-based vulnerability assessment framework.…”
Vulnerability analysis is crucial for software security. Inspired by the success of pre-trained models on software engineering tasks, this work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program
“…For this example, line 2-4, line 8 and line 10 are unrelated to the content and intent of this code change. As discussed in Section I, existing code change representation approaches either ignore the context [3], [8], [16], do not highlight the changed code [2], [13], [18], or consider all the context without adaptive information selection [14], [17]. These hinder their effectiveness and generality, and motivate us to propose the query-back mechanism to explicitly highlight the changed code and learn to adaptively capture information from the code change.…”
Section: Motivation Of Query-back Mechanismmentioning
confidence: 99%
“…However, many of them adopt task-specific architectures and are trained from scratch, which makes it non-trivial to adapt them to other tasks, especially the tasks with only small datasets. In addition, existing learning-based techniques either only focus on the changed code [3], [8], [16], separately encode the changed code and its context [14], [17], or encode the code change as a whole [2], [13], [18]. Some of them ignore the context or do not highlight the changed code.…”
Representing code changes as numeric feature vectors, i.e., code change representations, is usually an essential step to automate many software engineering tasks related to code changes, e.g., commit message generation and just-intime defect prediction. Intuitively, the quality of code change representations is crucial for the effectiveness of automated approaches. Prior work on code changes usually designs and evaluates code change representation approaches for a specific task, and little work has investigated code change encoders that can be used and jointly trained on various tasks. To fill this gap, this work proposes a novel Code Change Representation learning approach named CCRep, which can learn to encode code changes as feature vectors for diverse downstream tasks. Specifically, CCRep regards a code change as the combination of its before-change and after-change code, leverages a pretrained code model to obtain high-quality contextual embeddings of code, and uses a novel mechanism named query back to extract and encode the changed code fragments and make them explicitly interact with the whole code change. To evaluate CCRep and demonstrate its applicability to diverse code-change-related tasks, we apply it to three tasks: commit message generation, patch correctness assessment, and just-in-time defect prediction. Experimental results show that CCRep outperforms the state-ofthe-art techniques on each task.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.