Shangqing Liu scite author profile

Commit messages record code changes (e.g., feature modifications and bug repairs) in natural language, and are useful for program comprehension. Due to the frequent updates of software and time cost, developers are generally unmotivated to write commit messages for code changes. Therefore, automating the message writing process is necessitated. Previous studies on commit message generation have been benefited from generation models or retrieval models, but code structure of changed code, which can be important for capturing code semantics, has not been explicitly involved. Moreover, although generation models have the advantages of synthesizing commit messages for new code changes, they are not easy to bridge the semantic gap between code and natural languages which could be mitigated by retrieval models. In this paper, we propose a novel commit message generation model, named ATOM, which explicitly incorporates abstract syntax tree for representing code changes and integrates both retrieved and generated messages through hybrid ranking. Specifically, the hybrid ranking module can prioritize the most accurate message from both retrieved and generated messages regarding one code change. We evaluate the proposed model ATOM on our dataset crawled from 56 popular Java repositories. Experimental results demonstrate that ATOM increases the state-of-the-art models by 30.72% in terms of BLEU-4 (an accuracy measure that is widely used to evaluate text generation systems). Qualitative analysis also demonstrates the effectiveness of ATOM in generating accurate code commit messages.

show abstract

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search

Liu

Xie

Siow

et al. 2023

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search

Liu

Xie

et al. 2021

Preprint

View full text Add to dashboard Cite

Code search aims to retrieve the relevant code fragments based on a natural language query to improve the software productivity and quality. However, automatic code search is challenging due to the semantic gap between the source code and the query. Most existing approaches mainly consider the sequential information for embedding, where the structure information behind the text is not fully considered. In this paper, we design a novel neural network framework, named GraphSearchNet, to enable an effective and accurate source code search by jointly learning rich semantics of both source code and queries. Specifically, we propose to encode both source code and queries into two graphs with Bidirectional GGNN to capture the local structure information of the graphs. Furthermore, we enhance BiGGNN by utilizing the effective multi-head attention to supplement the global dependency that BiGGNN missed. The extensive experiments on both Java and Python datasets illustrate that GraphSearchNet outperforms current state-of-the-art works by a significant margin.

show abstract

Do different cross‐project defect prediction methods identify the same defective modules?

Chen

Qu³

et al. 2019

J Software Evolu Process

View full text Add to dashboard Cite

Cross‐project defect prediction (CPDP) is needed when the target projects are new projects or the projects have less training data, since these projects do not have sufficient historical data to build high‐quality prediction models. The researchers have proposed many CPDP methods, and previous studies have conducted extensive comparisons on the performance of different CPDP methods. However, to the best of our knowledge, it remains unclear whether different CPDP methods can identify the same defective modules, and this issue has not been thoroughly explored. In this article, we select 12 state‐of‐the‐art CPDP methods, including eight supervised methods and four unsupervised methods. We first compare the performance of these methods in the same experiment settings on five widely used datasets (ie, NASA, SOFTLAB, PROMISE, AEEEM, and ReLink) and rank these methods via the Scott‐Knott test. Final results confirm the competitiveness of unsupervised methods. Then we perform diversity analysis on defective modules for these methods by using the McNemar test. Empirical results verify that different CPDP methods may lead to difference in the modules predicted as defective, especially when the comparison is performed between the supervised methods and unsupervised methods. Finally, we also find there exist a certain number of defective modules, which cannot be correctly identified by any of the CPDP methods or can be correctly identified by only one CPDP method. These findings can be utilized to design more effective methods to further improve the performance of CPDP.

show abstract

ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking

Liu

Gao

Chen

et al. 2019

Preprint

View full text Add to dashboard Cite

SPI: Automated Identification of Security Patches via Commits

Zhou

Siow

Wang

et al. 2021

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

Security patches in open source software, providing security fixes to identified vulnerabilities, are crucial in protecting against cyber attacks. Security advisories and announcements are often publicly released to inform the users about potential security vulnerability. Despite the National Vulnerability Database (NVD) publishes identified vulnerabilities, a vast majority of vulnerabilities and their corresponding security patches remain beyond public exposure, e.g., in the open source libraries that are heavily relied on by developers. As many of these patches exist in open sourced projects, the problem of curating and gathering security patches can be difficult due to their hidden nature. An extensive and complete security patches dataset could help end-users such as security companies, e.g., building a security knowledge base, or researcher, e.g., aiding in vulnerability research. To efficiently curate security patches including undisclosed patches at large scale and low cost, we propose a deep neural-network-based approach built upon commits of open source repositories. First, we design and build security patch datasets that include 38,291 security-related commits and 1,045 Common Vulnerabilities and Exposures (CVE) patches from four large-scale C programming language libraries. We manually verify each commit, among the 38,291 security-related commits, to determine if they are security related. We devise and implement a deep learning-based security patch identification system that consists of two composite neural networks: one commit-message neural network that utilizes pretrained word representations learned from our commits dataset and one code-revision neural network that takes code before revision and after revision and learns the distinction on the statement level. Our system leverages the power of the two networks for Security Patch Identification. Evaluation results show that our system significantly outperforms SVM and K-fold stacking algorithms. The result on the combined dataset achieves as high as 87.93% F1-score and precision of 86.24%. We deployed our pipeline and learned model in an industrial production environment to evaluate the generalization ability of our approach. The industrial dataset consists of 298,917 commits from 410 new libraries that range from a wide functionalities. Our experiment results and observation on the industrial dataset proved that our approach can identify security patches effectively among open sourced projects.

show abstract

A unified framework to learn program semantics with graph neural networks

Liu

2020

View full text Add to dashboard Cite

Enhancing Security Patch Identification by Capturing Structures in Commits

Liu

Feng

et al. 2024

IEEE Trans. Dependable and Secure Comput.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shangqing Liu

ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search

GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search

Do different cross‐project defect prediction methods identify the same defective modules?

ATOM: Commit Message Generation Based on Abstract Syntax Tree and Hybrid Ranking

SPI: Automated Identification of Security Patches via Commits

A unified framework to learn program semantics with graph neural networks

Enhancing Security Patch Identification by Capturing Structures in Commits

Contact Info

Product

Resources

About