The Effects of Over and Under Sampling on Fault-prone Module Detection

Kamei, Yasutaka; Monden, Akito; Matsumoto, Shinsuke; Kakimoto, Takeshi; Matsumoto, Kenichi

doi:10.1109/esem.2007.28

Cited by 119 publications

(77 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is, while sub-sampling offers no improvement over un-sampled Bayesian learning, under-sampling does not harm classifier performance, 2 Due to differences in experimental methods, we find we cannot compare our results to the regression tree and LDA analysis of [21]. This last point is the most significant.…”

Section: Experiments #1: Over-and Under-samplingmentioning

confidence: 39%

Implications of ceiling effects in defect predictors

Menzies

Turhan

Bener

et al. 2008

Proceedings of the 4th International Workshop on Predictor Models in Software Engineering

149

113

View full text Add to dashboard Cite

Context: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults.Objective: We seek an explanation for this ceiling effect. Perhaps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners.Method: An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances.Results: Using micro-sampling, we find that as few as 50 instances yield as much information as larger training sets.Conclusions: We have found much evidence for the limited information hypothesis. Further progress in learning defect predictors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.

show abstract

Section: Experiments #1: Over-and Under-samplingmentioning

confidence: 39%

Implications of ceiling effects in defect predictors

Menzies

Turhan

Bener

et al. 2008

Proceedings of the 4th International Workshop on Predictor Models in Software Engineering

149

113

View full text Add to dashboard Cite

show abstract

“…The latter technique is a preprocessing procedure for balancing datasets with a large difference between the number of faulty and non-faulty classes, which has been found to cause performance degradation of fault-proneness models [37]. As we can observe in Tables 5 and 6, the percentages of specificity and sensitivity, and in consequence correctness, generally were improved using normalization, nonetheless specificity and sensitivity remain unbalanced in some cases.…”

Section: Discussionmentioning

confidence: 58%

A UML Approximation of Three Chidamber-Kemerer Metrics and Their Ability to Predict Faulty Code across Software Projects

Cruz

Ochimizu

2010

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYDesign-complexity metrics, while measured from the code, have shown to be good predictors of fault-prone object-oriented programs. Some of the most often used metrics are the Chidamber and Kemerer metrics (CK). This paper discusses how to make early predictions of fault-prone object-oriented classes, using a UML approximation of three CK metrics. First, we present a simple approach to approximate Weighted Methods per Class (WMC), Response For Class (RFC) and Coupling Between Objects (CBO) CK metrics using UML collaboration diagrams. Then, we study the application of two data normalization techniques. Such study has a twofold purpose: to decrease the error approximation in measuring the mentioned CK metrics from UML diagrams, and to obtain a more similar data distribution of these metrics among software projects so that better prediction results are obtained when using the same prediction model across different software projects. Finally, we construct three prediction models with the source code of a package of an open source software project (Mylyn from Eclipse), and we test them with several other packages and three different small size software projects, using their UML and code metrics for comparison. The results of our empirical study lead us to conclude that the proposed UML RFC and UML CBO metrics can predict fault-proneness of code almost with the same accuracy as their respective code metrics do. The elimination of outliers and the normalization procedure used were of great utility, not only for enabling our UML metrics to predict fault-proneness of code using a code-based prediction model but also for improving the prediction results of our models across different software packages and projects.

show abstract

“…of defects, fault distributions in each module and segregation of defects among modules. [39]  There is a shortage of business knowledge in data mining algorithms and causes serious performance issues when it is unable to retrieve required information concerned to software metrics with defect frequencies [40].  Generally low performance by fault forecasting models are due to imbalance in training datasets [41].…”

Section: A) Extremely Skewed and Unbalanced Datasetsmentioning

confidence: 99%