2020
DOI: 10.1007/s11219-020-09515-0
|View full text |Cite
|
Sign up to set email alerts
|

A public unified bug dataset for java and its assessment regarding metrics and bug prediction

Abstract: Bug datasets have been created and used by many researchers to build and validate novel bug prediction models. In this work, our aim is to collect existing public source code metric-based bug datasets and unify their contents. Furthermore, we wish to assess the plethora of collected metrics and the capabilities of the unified bug dataset in bug prediction. We considered 5 public datasets and we downloaded the corresponding source code for each system in the datasets and performed source code analysis to obtain… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
28
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 37 publications
(34 citation statements)
references
References 49 publications
1
28
0
1
Order By: Relevance
“…We show that our PLS-DA based class level prediction model achieves superior performance compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when up-sampling is applied on the largest open bug dataset we know [16,15,17], while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.…”
Section: Introductionmentioning
confidence: 84%
See 1 more Smart Citation
“…We show that our PLS-DA based class level prediction model achieves superior performance compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when up-sampling is applied on the largest open bug dataset we know [16,15,17], while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.…”
Section: Introductionmentioning
confidence: 84%
“…For creating, optimizing, and evaluating our statistical model, we used the Public Unified Bug Dataset for Java [16,15,17]. It contains the data entries of 5 different public bug datasets (PROMISE [45], Eclipse Bug Dataset [56], Bug Prediction Dataset [11], Bugcatchers Bug Dataset [24], and GitHub Bug Dataset [50]) in a unified manner.…”
Section: Dataset and Predictorsmentioning
confidence: 99%
“…In order to be able to predict errors in software with different ML techniques, we need a dataset of the right size and quality. The Unified Bug Dataset [Ferenc et al, 2020b] is suitable for this purpose. This dataset merges several datasets, which are the GitHub Bug Dataset [Tóth et al, 2016], the Promise [Jureczko and Madeyski, 2010] dataset, and the Bug Prediction Dataset [D'Ambros et al, 2010].…”
Section: Datasetsmentioning
confidence: 99%
“…We discuss the details of our methodology in Section 3. Finally, we validated the effectiveness of this source code representation to predict bugs on the Unified Bug Dataset [Ferenc et al, 2020b], which is a dataset of buggy and non-buggy classes implemented in Java. We report our results in Section 5, where we also sought answers to our research questions regarding this representation: RQ1 Is there a Doc2Vec parametrization that would produce similar or better results than learning based on code metrics?…”
Section: Introductionmentioning
confidence: 99%
“…SmartSHARK in conjunction with a HPC-Cluster provided us with the means to extract this information for each file in each commit of our candidate projects. OpenStaticAnaylzer is an open sourced version of the commercial tool SourceMeter (FrontEndART 2019) which has been used in multiple studies, e.g., Faragó et al (2015), Szóke et al (2014), and Ferenc et al (2014) and, more recently (Ferenc et al 2020). It works by constructing an Abstract Semantic Graph (ASG) from the source code which is then used to calculate static source code metrics.…”
Section: Metric Extractionmentioning
confidence: 99%