BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies

Madeiral, Fernanda; Urli, Simon; Maia, Marcelo de Almeida; Monperrus, Martin

doi:10.1109/saner.2019.8667991

Cited by 104 publications

(74 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are some benchmarks that were rarely used or never used so far: this is partially explained by the fact that some benchmarks were recently published (e.g. Bears [25]), thus they were not available when some repair tools were published.…”

Section: State Of Affairs On Test-suite-based Automatic Repair Tools mentioning

confidence: 99%

See 1 more Smart Citation

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Durieux

Madeiral

Martínez

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

Self Cite

118

124

View full text Add to dashboard Cite

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.

show abstract

Section: State Of Affairs On Test-suite-based Automatic Repair Tools mentioning

confidence: 99%

“…Bears [25] contains 251 bugs from 72 different GitHub projects with an average size of 62,597 lines of Java code. It was created by mining software repositories based on commit building state from Travis Continuous Integration.…”

Section: Subject Benchmarks Of Bugsmentioning

confidence: 99%

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Durieux

Madeiral

Martínez

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

Self Cite

118

124

View full text Add to dashboard Cite

show abstract

“…The performance achieved by TBar on Defects4J may not be reached on a bigger, more diverse and more representative dataset. To address this threat, new benchmarks such as Bugs.jar [61] and Bears [46] should be investigated.…”

Section: Threats To Validitymentioning

confidence: 99%

TBar: revisiting template-based automated program repair

Liu

Koyuncu

Kim

et al. 2019

Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

216

155

View full text Add to dashboard Cite

We revisit the performance of template-based APR to build comprehensive knowledge about the effectiveness of fix patterns, and to highlight the importance of complementary steps such as fault localization or donor code retrieval. To that end, we first investigate the literature to collect, summarize and label recurrently-used fix patterns. Based on the investigation, we build TBar, a straightforward APR tool that systematically attempts to apply these fix patterns to program bugs. We thoroughly evaluate TBar on the De-fects4J benchmark. In particular, we assess the actual qualitative and quantitative diversity of fix patterns, as well as their effectiveness in yielding plausible or correct patches. Eventually, we find that, assuming a perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs. Replicating a standard and practical pipeline of APR assessment, we demonstrate that TBar correctly fixes 43 bugs from Defects4J, an unprecedented performance in the literature (including all approaches, i.e., template-based, stochastic mutation-based or synthesis-based APR). CCS CONCEPTS• Software and its engineering → Software verification and validation; Software defect analysis; Software testing and debugging. KEYWORDSAutomated program repair, fix pattern, empirical assessment.

show abstract

“…On the contrary, the developer test is shorter and directly targets the changed behavior, which is better. JSOUP#3676B13 19 : This change is a pull request (i.e. a set of commits) and introduces 5 new behavioral changes.…”

Section: Rq4: How Do Human and Generated Tests That Detect Behavioralmentioning

confidence: 99%

An approach and benchmark to detect behavioral changes of commits in continuous integration

et al. 2020

Self Cite

View full text Add to dashboard Cite

When a developer pushes a change to an application's codebase, a good practice is to have a test case specifying this behavioral change. Thanks to continuous integration (CI), the test is run on subsequent commits to check that they do no introduce a regression for that behavior.In this paper, we propose an approach that detects behavioral changes in commits. As input, it takes a program, its test suite, and a commit. Its output is a set of test methods that capture the behavioral difference between the pre-commit and postcommit versions of the program. We call our approach DCI (Detecting behavioral changes in CI). It works by generating variations of the existing test cases through (i) assertion amplification and (ii) a search-based exploration of the input space.We evaluate our approach on a curated set of 60 commits from 6 open source Java projects. To our knowledge, this is the first ever curated dataset of real-world behavioral changes. Our evaluation shows that DCI is able to generate test methods that detect behavioral changes. Our approach is fully automated and can be integrated into current development processes. The main limitations are that it targets unit tests B. Danglot Inria Lille -Nord Europe

show abstract

BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies

Cited by 104 publications

References 17 publications

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

TBar: revisiting template-based automated program repair

An approach and benchmark to detect behavioral changes of commits in continuous integration

Contact Info

Product

Resources

About