Fabian Trautsch scite author profile

Context The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data. Recent research uncovered potential problems in different parts of the SZZ algorithm. Most defect prediction data sets provide only static code metrics as features, while research indicates that other features are also important. Objective We provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of commonly used features on results. Method We used a combination of manual validation and adopted or improved heuristics for the collection of defect data. We conducted an empirical study on 398 releases of 38 Apache projects. Results We found that only half of the bug fixing commits determined by SZZ are actually bug fixing. If a six-month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is not significant. Conclusion Problems with inaccurate defect labels are a severe threat to the validity of the state of the art of defect prediction. Small feature sets seem to be a less severe threat.

show abstract

Addressing problems with replicability and validity of repository mining studies through a smart data platform

Trautsch

Herbold

Makedonski

et al. 2017

Empir Software Eng

View full text Add to dashboard Cite

On the feasibility of automated prediction of bug and non-bug issues

Herbold

Trautsch

2020

Empir Software Eng

View full text Add to dashboard Cite

Context Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue. Objective We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if issues are bugs and evaluate if we can improve existing models by incorporating manually specified knowledge about issues. Method We train different models for the title and description of the issue to account for the difference in structure between these fields, e.g., the length. Moreover, we manually detect issues whose description contains a null pointer exception, as these are strong indicators that issues are bugs. Results Our approach performs best overall, but not significantly different from an approach from the literature based on the fastText classifier from Facebook AI Research. The small improvements in prediction performance are due to structural information about the issues we used. We found that using information about the content of issues in form of null pointer exceptions is not useful. We demonstrate the usefulness of issue type prediction through the example of labelling bugfixing commits. Conclusions Issue type prediction can be a useful tool if the use case allows either for a certain amount of missed bug reports or the prediction of too many issues as bug is acceptable.

show abstract

Are unit and integration test definitions still valid for modern Java projects? An empirical study on open-source projects

Trautsch¹,

Herbold²,

Grabowski³

2020

Journal of Systems and Software

View full text Add to dashboard Cite

The SmartSHARK ecosystem for software repository mining

Trautsch

Herbold

et al. 2020

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Fabian Trautsch

Problems with SZZ and features: An empirical study of the state of practice of defect prediction data collection

Addressing problems with replicability and validity of repository mining studies through a smart data platform

On the feasibility of automated prediction of bug and non-bug issues

Are unit and integration test definitions still valid for modern Java projects? An empirical study on open-source projects

The SmartSHARK ecosystem for software repository mining

Contact Info

Product

Resources

About