Choosing software metrics for defect prediction: an investigation on feature selection techniques

Gao, Kehan; Khoshgoftaar, Taghi M.; Wang, Huanjing; Seliya, Naeem

doi:10.1002/spe.1043

Cited by 237 publications

(156 citation statements)

References 33 publications

Supporting

Mentioning

150

Contrasting

Unclassified

Order By: Relevance

“…So far, they have been widely used to estimate the defect-proneness of software components, and more details of these approaches can refer to the recent surveys [3,4]. On the other hand, considering a large number of software metrics, feature subset selection and dimensionality reduction techniques have also been applied to these new defect prediction methods [22,23], and many empirical studies have demonstrated that they are able to achieve higher accuracy and computing efficiency by removing redundant and irrelevant software metrics [10].…”

Section: Related Workmentioning

confidence: 99%

An empirical study on predicting defect numbers

Chen

2015

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

Abstract-Defect prediction is an important activity to make software testing processes more targeted and efficient. Many methods have been proposed to predict the defect-proneness of software components using supervised classification techniques in within-and cross-project scenarios. However, very few prior studies address the above issue from the perspective of predictive analytics. How to make an appropriate decision among different prediction approaches in a given scenario remains unclear. In this paper, we empirically investigate the feasibility of defect numbers prediction with typical regression models in different scenarios. The experiments on six open-source software projects in PROMISE repository show that the prediction model built with Decision Tree Regression seems to be the best estimator in both of the scenarios, and that for all the prediction models, the results yielded in the cross-project scenario can be comparable to (or sometimes better than) those in the within-project scenario when choosing suitable training data. Therefore, the findings provide a useful insight into defect numbers prediction for those new and inactive projects.

show abstract

Section: Related Workmentioning

confidence: 99%

An empirical study on predicting defect numbers

Chen

2015

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

show abstract

“…The eight methods used in this work are Chi-Square (CS), Correlation (Cor), Information Gain (IG), Symmetrical Uncertainty (SU), Fisher Score (FS), Welch T-Statistic (WTS), ReliefF (RF), One Rule (OneR). The reason why we choose these methods is that they are widely used in defect prediction and belong to different feature selection families [28], [30]. CS is a statistic-based method, Cor is a correlation-based method, IG and SU are entropy-based methods, FS and WTS are first order statistics-based methods, RF is a instance-based, OneR is a classifier-based method.…”

Section: A Feature Ranking Methodsmentioning

confidence: 99%

“…Various methods have been DOI reference number: 10.18293/SEKE2017-097 successfully introduced to assist the selection of a feature subset that could benefit the defect prediction process on SDD. Previous studies have shown that diverse feature selection methods yield quite different performance on prediction models for SDD [6], [7], which implies that different methods might be not equivalent, that is, different methods would identify different set of features as relevant. However, to the best of our knowledge, no previous studies proposed a method to investigate the equivalence of different feature selection methods.…”

Section: Introductionmentioning

confidence: 99%

An Empirical Study on the Equivalence and Stability of Feature Selection for Noisy Software Defect Data

Zhou

Liu²,

Xia

et al. 2017

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

 Abstract-Software Defect Data (SDD) are used to build defect prediction models for software quality assurance. Existing work employs feature selection to eliminate irrelevant features in the data to improve prediction performance. Previous studies have shown that different feature selection methods do not always yield similar prediction performance on SDD, which indicates that these methods are not equivalent. Also, previous studies have shown that SDD usually contains noise that may interfere the process of feature selection. In this work, we empirically investigate and measure the equivalence of different feature selection methods for SDD. Further, we intend to analyze the stability of the methods for noisy SDD. We perform statistical analyses on eight projects from NASA dataset with eight feature selection methods. For the equivalence analysis, we introduce Principal Component Analysis (PCA) and overlap index to qualitatively and quantitatively analyze the equivalence of these methods respectively. For the stability analysis, we apply consistency index to measure the stability of these methods. Experimental results indicate that different feature selection methods are indeed not equivalent to each other, and Correlation and Fisher Score methods achieve better stability.

show abstract

“…Gao et al [19] studied four different filter-based feature selection methods with five different classifiers on a large telecommunication system and found that the Kolmogorov-Smirnov method performed the best. Gao et al [20] presented a comparative investigation to evaluate their proposed hybrid feature selection method, which first uses feature ranking to reduce the search space and then applies feature subset selection. In order to investigate different feature selection methods to classification-based bug prediction, Shivaji et al [21] utilized six feature selection methods to iteratively remove irrelevant features until achieving the best performance of F-measure.…”

Section: B Feature Selection In Defect Predictionmentioning

confidence: 99%

FSCR:A Feature Selection Method for Software Defect Prediction

Ma²,

Ma³

et al. 2017

International Conferences on Software Engineering and Knowledge Engineering

View full text Add to dashboard Cite

Abstract-Prediction the number of faults in software modules can be more helpful instead of predicting the modules being faulty or non-faulty. Some regression models have been used for predicting the number of faults. However, the software defect data may involve irrelevant and redundant module features, which will degrade the performance of these regression models. To address such issue, this paper proposes a feature selection method based on Feature Spectral Clustering and feature Ranking (FSCR) for the number of software faults prediction. First, FSCR groups the original features with spectral clustering according to the correlation between every two features. Second, FSCR employs ReliefF algorithm to compute the relevance between each feature with respect to the number of faults and selects top p most relevant features from each resulted cluster. We evaluate our proposed method on 6 widely-studied project datasets with four performance metrics. Comparison with five existing feature selection methods demonstrates that FSCR is effective in selecting features for the number of faults prediction.

show abstract

Choosing software metrics for defect prediction: an investigation on feature selection techniques

Cited by 237 publications

References 33 publications

An empirical study on predicting defect numbers

An empirical study on predicting defect numbers

An Empirical Study on the Equivalence and Stability of Feature Selection for Noisy Software Defect Data

FSCR:A Feature Selection Method for Software Defect Prediction

Contact Info

Product

Resources

About