Cross-project defect prediction (CPDP) is a practical solution that allows software defect prediction (SDP) to be used earlier in the software lifecycle. With the CPDP technique, the software defect predictor trained by labeled data of mature projects can be applied for the prediction task of a new project. Most previous CPDP approaches ignored the semantic information in the source code, and existing semantic-feature-based SDP methods do not take into account the data distribution divergence between projects. These limitations may weaken defect prediction performance. To solve these problems, we propose a novel approach, the transfer convolutional neural network (TCNN), to mine the transferable semantic (deep-learning (DL)-generated) features for CPDP tasks. Specifically, our approach first parses the source file into integer vectors as the network inputs. Next, to obtain the TCNN model, a matching layer is added into convolutional neural network where the hidden representations of the source and target project-specific data are embedded into a reproducing kernel Hilbert space for distribution matching. By simultaneously minimizing classification error and distribution divergence between projects, the constructed TCNN could extract the transferable DL-generated features. Finally, without losing the information contained in handcrafted features, we combine them with transferable DL-generated features to form the joint features for CPDP performing. Experiments based on 10 benchmark projects (with 90 pairs of CPDP tasks) showed that the proposed TCNN method is superior to the reference methods.
Software quality plays an important role in the software lifecycle. Traditional software defect prediction approaches mainly focused on using hand-crafted features to detect defects. However, like human languages, programming languages contain rich semantic and structural information, and the cause of defective code is closely related to its context. Failing to catch this significant information, the performance of traditional approaches is far from satisfactory. In this study, the authors leveraged a long short-term memory (LSTM) network to automatically learn the semantic and contextual features from the source code. Specifically, they first extract the program's Abstract Syntax Trees (ASTs), which is made up of AST nodes, and then evaluate what and how much information they can preserve for several node types. They traverse the AST of each file and fed them into the LSTM network to automatically the semantic and contextual features of the program, which is then used to determine whether the file is defective. Experimental results on several opensource projects showed that the proposed LSTM method is superior to the state-of-the-art methods.
Cross-project defect prediction (CPDP) is a feasible way to perform software defect prediction (SDP) when lacking historical data. Recent CPDP approaches have employed deep learning techniques to better exploit the information from the program's abstract syntax trees (ASTs). However, the granularity of the AST nodes and the data distribution difference between projects may have negative impacts on the prediction performance, which many CPDP studies didn't take into consideration. To handle these issues, this paper explores a better AST node granularity and proposes a CPDP framework based on multi-kernel transfer convolutional neural networks. Specifically, for AST node granularity, we explore the difference of three AST node granularities and then compare the prediction performance of each granularity on several prediction models. For the CPDP framework, we first parse the program source code into ASTs and then encode the AST nodes into numerical vectors using the embedding technique. Secondly, to mine transferable semantic features, the encoded ASTs are fed into a convolutional neural network, in which a multi-kernel matching layer is added to minimize the data distribution divergence between the source and target project. Finally, to make use of the information from the handcrafted features, the semantic features mined from the AST are joint with handcrafted features to form the joint features for CPDP. We evaluate our approach on 110 CPDP tasks formed by 11 open-source projects and results show that the proposed CPDP method outperforms most deep learning-based approaches. INDEX TERMS Abstract syntax tree, cross-project defect prediction, maximum mean discrepancy, multikernel, transfer learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.