First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007) 2007
DOI: 10.1109/esem.2007.28
|View full text |Cite
|
Sign up to set email alerts
|

The Effects of Over and Under Sampling on Fault-prone Module Detection

Abstract: The goal of this paper is to improve the prediction performance of fault-prone module prediction models (fault-proneness models) by employing over/under sampling methods, which are preprocessing procedures for a fit dataset. The sampling methods are expected to improve prediction performance when the fit dataset is imbalanced, i.e. there exists a large difference between the number of fault-prone modules and not-fault-prone modules. So far, there has been no research reporting the effects of applying sampling … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
70
0

Year Published

2008
2008
2022
2022

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 119 publications
(77 citation statements)
references
References 10 publications
2
70
0
Order By: Relevance
“…That is, while sub-sampling offers no improvement over un-sampled Bayesian learning, under-sampling does not harm classifier performance, 2 Due to differences in experimental methods, we find we cannot compare our results to the regression tree and LDA analysis of [21]. This last point is the most significant.…”
Section: Experiments #1: Over-and Under-samplingmentioning
confidence: 39%
“…That is, while sub-sampling offers no improvement over un-sampled Bayesian learning, under-sampling does not harm classifier performance, 2 Due to differences in experimental methods, we find we cannot compare our results to the regression tree and LDA analysis of [21]. This last point is the most significant.…”
Section: Experiments #1: Over-and Under-samplingmentioning
confidence: 39%
“…The latter technique is a preprocessing procedure for balancing datasets with a large difference between the number of faulty and non-faulty classes, which has been found to cause performance degradation of fault-proneness models [37]. As we can observe in Tables 5 and 6, the percentages of specificity and sensitivity, and in consequence correctness, generally were improved using normalization, nonetheless specificity and sensitivity remain unbalanced in some cases.…”
Section: Discussionmentioning
confidence: 58%
“…of defects, fault distributions in each module and segregation of defects among modules. [39]  There is a shortage of business knowledge in data mining algorithms and causes serious performance issues when it is unable to retrieve required information concerned to software metrics with defect frequencies [40].  Generally low performance by fault forecasting models are due to imbalance in training datasets [41].…”
Section: A) Extremely Skewed and Unbalanced Datasetsmentioning
confidence: 99%