2015 IEEE/ACM 37th IEEE International Conference on Software Engineering 2015
DOI: 10.1109/icse.2015.93
|View full text |Cite
|
Sign up to set email alerts
|

The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models

Abstract: Abstract-The reliability of a prediction model depends on the quality of the data from which it was trained. Therefore, defect prediction models may be unreliable if they are trained using noisy data. Recent research suggests that randomly-injected noise that changes the classification (label) of software modules from defective to clean (and vice versa) can impact the performance of defect models. Yet, in reality, incorrectly labelled (i.e., mislabelled) issue reports are likely non-random. In this paper, we s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
54
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
1

Relationship

3
6

Authors

Journals

citations
Cited by 94 publications
(64 citation statements)
references
References 43 publications
3
54
1
Order By: Relevance
“…Kim et al [72] find that the randomly-generated noise has a large negative impact on the performance of defect models. On the other hand, our recent work [138] shows that realistic noise (i.e., noise generated by actually mislabelled issue reports [54]) does not typically impact the precision of defect prediction models.…”
Section: The Experimental Design Of Defect Prediction Modelsmentioning
confidence: 97%
See 1 more Smart Citation
“…Kim et al [72] find that the randomly-generated noise has a large negative impact on the performance of defect models. On the other hand, our recent work [138] shows that realistic noise (i.e., noise generated by actually mislabelled issue reports [54]) does not typically impact the precision of defect prediction models.…”
Section: The Experimental Design Of Defect Prediction Modelsmentioning
confidence: 97%
“…Jiang et al [62] and Bibi et al [13] use the default value of k for the k-nearest neighbours classification technique (k = 1). In our prior work (e.g., [39,138]), we ourselves have also used default classification settings.…”
Section: Related Work and Research Questionsmentioning
confidence: 99%
“…The algorithm was applied on each system of the dataset and for each set of predictors considered (i.e., based on structural metrics [30], entropy of changes [5], number of developer [56], scattering metrics [32], [33], and antipattern metrics [27]), and for this reason we had to analyze 34 ranks for each basic model. Therefore, as suggested by previous work [122], [123], [124] we adopted again the Scott-Knott ESD test [90], which in this case had the goal to find statistically significant relevant features composing the models.…”
Section: Rq3 -Gain Provided By the Intensity Indexmentioning
confidence: 99%
“…Moreover, AUC and Brier score are robust to the data where the distribution of a dependent variable is skewed (Fawcett, 2006). Nonetheless, we also measure precision, recall, and F-measure which are commonly used in software engineering literature (Elish and Elish, 2008;Foo et al, 2015;Tantithamthavorn et al, 2015;Zimmermann et al, 2005). Below, we describe each of the performance measures:…”
Section: Model Analysis (Ma)mentioning
confidence: 99%