Applying Machine Learning to Predict Software Fault Proneness Using Change Metrics, Static Code Metrics, and a Combination of Them

Alshehri, Yasser; Goševa-Popstojanova, Katerina; Dzielski, Dale; Devine, Thomas

doi:10.1109/secon.2018.8478911

Cited by 19 publications

(23 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once obtained the labelled dataset, we have applied various ML techniques, previously selected among those that have already been used in the Software Engineering field, and, more in detail, in the software defect prediction problem, as presented in previous literature [25]: AdaBoost (AB) [26], Boosted Logistic Regression (BLR) [21,27], J48 [28], Cost-Sensitive C5.0 (C5.0 Cost) [29], Logistic Model Tree (LMT) [30], Multilayer Perceptron (MLP) [31], Support Vector Machines with Radial Basis Function Kernel (SVM Radial) [32], Partial Least Squares (PLS) [33], Boosted Tree (BT) [34] and Random Forest (RF) [35]. In order to compare the different ML techniques, we have employed the most common performance indicators detailed in literature.…”

Section: Methodsmentioning

confidence: 99%

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

Software defect prediction is an activity that aims at narrowing down the most likely defect-prone software modules and helping developers and testers to prioritize inspection and testing. This activity can be addressed by using Machine Learning techniques applied to software metrics datasets that are usually unlabelled, i.e. they lack modules classification in terms of defectiveness. To overcome this limitation, in addition to the usual data pre-processing operations to manage mission values and/or to remove inconsistencies, researches have to adopt an approach to label their unlabelled software datasets. The extraction of defectiveness data to label all the instances of the datasets is an extremely time and effort consuming operation. In literature, many studies have introduced approaches to build a defect prediction models on unlabelled datasets. In this paper, we describe the analysis of new unlabelled datasets from WLCG software, coming from HEP-related experiments and middleware, by using Machine Learning techniques. We have experimented new approaches to label the various modules due to the heterogeneity of software metrics distribution. We discuss a number of lessons learned from conducting these activities, what has worked, what has not and how our research can be improved.

show abstract

Section: Methodsmentioning

confidence: 99%

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Commonly used metrics can be generally divided into three categories: static code metrics, network metrics, and process metrics. Static code metrics measure the complexity of source code and assume that the more complex the source code is, the more likely defects are to appear [37]. Network metrics [38], which are effective for SDP, are social network analysis (SNA) metrics calculated based on the dependency graph of a software system and quantify the topological structure of each node of the dependency graph in a certain sense.…”

Section: Related Work a Software Defect Predictionmentioning

confidence: 99%

Semantic Feature Learning via Dual Sequences for Defect Prediction

Lin

2021

IEEE Access

View full text Add to dashboard Cite

Software defect prediction (SDP) can help developers reasonably allocate limited resources for locating bugs and prioritizing their testing efforts. Existing methods often serialize an Abstract Syntax Tree (AST) obtained from the program source code into a token sequence, which is then inputted into the deep learning model to learn the semantic features. However, there are different ASTs with the same token sequence, and it is impossible to distinguish the tree structure of the ASTs only by a token sequence. To solve this problem, this paper proposes a framework called Semantic Feature Learning via Dual Sequences (SFLDS), which can capture the semantic and structural information in the AST for feature generation. Specifically, based on the AST, we select the representative nodes in the AST and convert the program source code into a simplified AST (S-AST). Our method introduces two sequences to represent the semantic and structural information of the S-AST, one is the result of traversing the S-AST node in pre-order, and another is composed of parent nodes. Then each token in the dual sequences is encoded as a numerical vector via mapping and word embedding. Finally, we use a bi-directional long short-term memory (BiLSTM) based neural network to automatically generate semantic features from the dual sequences for SDP. In addition, to leverage the statistical characteristics contained in the handcrafted metrics, we also propose a framework called Defect Prediction via SFLDS (DP-SFLDS) which combines the semantic features generated from SFLDS with handcrafted metrics to perform SDP. In our empirical studies, eight open-source Java projects from the PROMISE repository are chosen as our empirical subjects. Experimental results show that our proposed approach can perform better than several state-of-the-art baseline SDP methods.INDEX TERMS Software defect prediction, abstract syntax tree, deep learning, bi-directional long short-term memory network.LU LU received the Ph.D. degree from Xi'an Jiaotong University, in 1999. He is currently a Professor with the School of Computer Science and Engineering, South China University of Technology, China. His main research interests include software engineering, software testing, and software architecture design.

show abstract

“…We provide all the classifiers the same data and we fixed the ratio to 0.3 for the training and testing sets. We use the confusion matrix shown in Table 4 to compute the performance of the classifiers [39]. This matrix provides the values of the following metrics such as accuracy, precision, recall, the F1-score expressed as follows:…”

Section: A Fraud Detection and Risk Measurementmentioning

confidence: 99%

A Secure AI-Driven Architecture for Automated Insurance Systems: Fraud Detection and Risk Measurement

et al. 2020

View full text Add to dashboard Cite

The private insurance sector is recognized as one of the fastest-growing industries. This rapid growth has fueled incredible transformations over the past decade. Nowadays, there exist insurance products for most high-value assets such as vehicles, jewellery, health/life, and homes. Insurance companies are at the forefront in adopting cutting-edge operations, processes, and mathematical models to maximize profit whilst servicing their customers claims. Traditional methods that are exclusively based on human-in-theloop models are very time-consuming and inaccurate. In this paper, we develop a secure and automated insurance system framework that reduces human interaction, secures the insurance activities, alerts and informs about risky customers, detects fraudulent claims, and reduces monetary loss for the insurance sector. After presenting the blockchain-based framework to enable secure transactions and data sharing among different interacting agents within the insurance network, we propose to employ the extreme gradient boosting (XGBoost) machine learning algorithm for the aforementioned insurance services and compare its performances with those of other state-of-the-art algorithms. The obtained results reveal that, when applied to an auto insurance dataset, the XGboost achieves high performance gains compared to other existing learning algorithms. For instance, it reaches 7% higher accuracy compared to decision tree models when detecting fraudulent claims. The obtained results reveal that, when applied to an auto insurance dataset, the XGboost achieves high performance gains compared to other existing learning algorithms. For instance, it reaches 7% higher accuracy compared to decision tree models when detecting fraudulent claims. Furthermore, we propose an online learning solution to automatically deal with real-time updates of the insurance network and we show that it outperforms another online state-of-the-art algorithm. Finally, we combine the developed machine learning modules with the hyperledger fabric composer to implement and emulate the artificial intelligence and blockchain-based framework.

show abstract

Applying Machine Learning to Predict Software Fault Proneness Using Change Metrics, Static Code Metrics, and a Combination of Them

Cited by 19 publications

References 5 publications

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

Semantic Feature Learning via Dual Sequences for Defect Prediction

A Secure AI-Driven Architecture for Automated Insurance Systems: Fraud Detection and Risk Measurement

Contact Info

Product

Resources

About