Heterogeneous fault prediction with cost‐sensitive domain adaptation

Li, Zhiqiang; Jing, Xiao‐Yuan; Zhu, Xiaoke

doi:10.1002/stvr.1658

Cited by 23 publications

(10 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The within-project defect prediction model to detect the DP instances is constructed from historical defect data of the same project [30,46]. However, in practice, there is not a sufficient amount of such historical data for some projects [47,48], so cross-project defect prediction approach is adopted to predict defects in a project via prediction models trained from historical defect data of other projects [11,[49][50][51]. This article is based on the within-project defect prediction settings.…”

Section: Related Workmentioning

confidence: 99%

“…It can be beneficial to guide efforts of software testing and can be effective in optimal allocation of testing resources [5][6][7][8]. Yet, one of the challenges of SDP is CIP [9][10][11]. CIP means that the number of NDP artifacts is much more than the artifacts that have defects.…”

Section: Introductionmentioning

confidence: 99%

“…In SDP, incorrect prediction of defective artifacts might result in the escape of critical errors which would lead to lower quality of the software, overshooting of schedule deadlines, higher testing costs and finally a poor reputation for the software company [18]. Thus, it is vital for software practitioners to effectively handle CIP in SDP [11]. To address the CIP, there are three common approaches [19]: data level, algorithm level and cost-sensitive learning (CSL) methods that incorporate both datalevel and algorithm-level approaches to integrate different misclassification costs in the dataset by data space weighting to choose the best distribution for training [20,21].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CSSG: A cost‐sensitive stacked generalization approach for software defect prediction

Eivazpour

Keyvanpour

2021

Software Testing Verif & Rel

View full text Add to dashboard Cite

The prediction of software artifacts on defect-prone (DP) or non-defect-prone (NDP) classes during the testing phase helps minimize software business costs, which is a classification task in software defect prediction (SDP) field. Machine learning methods are helpful for the task, although they face the challenge of data imbalance distribution. The challenge leads to serious misclassification of artifacts, which will disrupt the predictor's performance. The previously developed stacking ensemble methods do not consider the cost issue to handle the class imbalance problem (CIP) over the training dataset in the SDP field. To bridge this research gap, in the cost-sensitive stacked generalization (CSSG) approach, we try to combine the staking ensemble learning method with cost-sensitive learning (CSL) since the CSL purpose is to reduce misclassification costs. In the cost-sensitive stacked generalization (CSSG) approach, logistic regression (LR) and extremely randomized trees classifiers in cases of CSL and cost-insensitive are used as a final classifier of stacking scheme. To evaluate the performance of CSSG, we use six performance measures. Several experiments are carried out to compare the CSSG with some cost-sensitive ensemble methods on 15 benchmark datasets with different imbalance levels. The results indicate that the CSSG can be an effective solution to the CIP than other compared methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CSSG: A cost‐sensitive stacked generalization approach for software defect prediction

Eivazpour

Keyvanpour

2021

Software Testing Verif & Rel

View full text Add to dashboard Cite

show abstract

“…Li et al proposed a new cost sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP, which made the data distributions of source and target projects much more similar in the nonlinear feature space [3]. Li et al not only made better use of two projects but also alleviated the class imbalance problem by setting different misclassification costs for different samples [4]. Li et al proposed a multi-source selection based manifold discriminant alignment (MSMDA) approach.…”

Section: Introductionmentioning

confidence: 99%

Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction

Wang

Zhang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Heterogeneous defect prediction (HDP) aims to predict the defect tendency of modules in one project using heterogeneous data collected from other projects. It sufficiently incorporates the two characteristics of the defect prediction data: (1) datasets could have different metrics and distribution, and (2) data could be highly imbalanced. In this paper, we propose a few-shot learning based balanced distribution adaptation (FSLBDA) approach for heterogeneous defect prediction, which takes into consideration the two characteristics of the defect prediction data. Class imbalance of the defect datasets can be solved with undersampling, but the scale of the training datasets will be smaller. Specifically, we first remove redundant metrics of datasets with extreme gradient boosting. Then, we reduce the data difference between the source domain and the target domain with the balanced distribution adaptation. It considers the marginal distribution and the probability of conditional distribution differences and adaptively assigns different weights to them. Finally, we use adaptive boosting to relieve the influence caused by the size of the training dataset is smaller, which can improve the accuracy of the defect prediction model. We conduct experiments on 17 projects from 4 datasets using 3 indicators (i.e., AUC, G-mean, F-measure). Compared to three classic approaches, the experimental results show that FSLBDA can effectively improve the prediction performance.

show abstract

“…Software defect prediction (SDP) is a hot research topic in current software engineering research domain, and it can help to optimize test resource allocation by predicting defect‐prone modules (the granularity of the module can be set as file, class, code change as needed) in advance . A number of defect prediction approaches have been proposed, and these approaches mainly apply machine learning techniques to build prediction models by mining data stored in software historical repositories . These approaches typically use various features (ie, metrics) to measure extracted modules from repositories and then apply machine learning algorithms to predict if a new module is defective or not.…”

Section: Introductionmentioning

confidence: 99%

Multitask defect prediction

Chen

Xia

et al. 2019

J Software Evolu Process

View full text Add to dashboard Cite

Within‐project defect prediction assumes that we have sufficient labeled data from the same project, while cross‐project defect prediction assumes that we have plenty of labeled data from source projects. However, in practice, we might only have limited labeled data from both the source and target projects in some scenarios. In this paper, we want to apply multitask learning to investigate such a new scenario. To our best knowledge, this problem (ie, both the source project and the target project have limited labeled data) has not been thoroughly investigated, and we are the first to propose a novel multitask defect prediction approach mask. mask consists of a differential evolution optimization phase and a multitask learning phase. The former phase aims to find optimal weights for shared and nonshared information in related projects (ie, the target project and its related source projects), while the latter phase builds prediction models for each project simultaneously. To verify the effectiveness of mask, we perform experimental studies on 18 real‐world software projects and compare our approach with four state‐of‐the‐art baseline approaches: single‐task learning (STL), simple combined learning (SCL), Peters filter, and Burak filter. Experimental results show that mask can achieve F1 of 0.397 and AUC of 0.608 on average with a few labeled data (ie, 10% of data). Across the 18 projects, mask can outperform baseline methods significantly in terms of F1 and AUC. Therefore, by utilizing the relatedness among multiple projects, mask can perform significantly better than the state‐of‐the‐art methods. The results confirm that mask is promising for software defect prediction when the source and target projects both have limited training data.

show abstract

Heterogeneous fault prediction with cost‐sensitive domain adaptation

Cited by 23 publications

References 75 publications

CSSG: A cost‐sensitive stacked generalization approach for software defect prediction

CSSG: A cost‐sensitive stacked generalization approach for software defect prediction

Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction

Multitask defect prediction

Contact Info

Product

Resources

About