Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system

Yu, Zhongxing; Martínez, Matías; Danglot, Benjamin; Durieux, Thomas; Monperrus, Martin

doi:10.1007/s10664-018-9619-4

Cited by 56 publications

(50 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To sum up, our contributions are: • A new version of QuixBugs that is usable for automatic repair research on Java programs, together with extensive data about the characteristics of QuixBugs. • The confirmation of 2 empirical facts of program repair, improving their external validity: 1) the state-of-the-art program repair tools produce overfitting patches, this confirms the results of [35], [31], [15]; 2) the state-ofthe-art program repair tools also produce correct patches [29], [20]; 3) automatically generated tests can help to assess the correctness of patches in scientific studies, this confirms the results of [40], [36], [41]. • Three new and important findings about program repair: 1) the state-of-the-art program repair tools are able to repair programs with only failing test cases and no passing tests at all; 2) it is useful to design program specific test generators to discard incorrect patches; and 3) a small number of automatically generated test cases is enough to identify incorrect patches in scientific studies.…”

Section: Introductionsupporting

confidence: 56%

“…In our experiment, we consider three techniques for patch correctness assessment: a) using automatically generated tests by a search-based approach based on a reference version [41]; b) using automatically generated tests by a program specific generator based on a reference version [2]; and c) manual analysis of patch correctness [20]. a) Search-based Test Generation Technique: Using automated test generation is one way for assessing patch correctness [37], [36], [41], [40]. In our study, the search-based test generator technique takes as input a reference version of buggy program.…”

Section: B Methodologymentioning

confidence: 99%

“…Eventually, we obtain n different independent JUnit test suites for each program. Since Evosuite is a randomized algorithm, we take n = 30 for the best practice [41]. We further remove those generated tests that fail on the reference version (due to limitation of Evosuite).…”

Section: B Methodologymentioning

confidence: 99%

“…On the contrary, a "correct" patch means that it is not overfitting to the input data and to the considered test cases. To evaluate the correctness of patches, we use three different patch correctness techniques based on: 1) a search-based test generation tool [41]; 2) a custom program specific test generation tool [2]; and 3) manual analysis [20]. In RQ4, we analyze in detail the effectiveness of automated patch correctness assessment techniques used in RQ3.…”

Section: A Research Questionsmentioning

confidence: 99%

See 3 more Smart Citations

A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark

Yang

Martínez

Durieux

et al. 2019

2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF)

Self Cite

View full text Add to dashboard Cite

Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat to the external validity of the findings of the program repair research community. In this paper, we perform an automatic repair experiment on a benchmark called QuixBugs that has never been studied in the context of program repair. In this study, we report on the characteristics of QuixBugs, and study five repair systems, Arja, Astor, Nopol, NPEfix and RSRepair, which are representatives of generate-and-validate repair techniques and synthesis repair techniques. We propose three patch correctness assessment techniques to comprehensively study overfitting and incorrect patches. Our key results are: 1) 15 / 40 buggy programs in the QuixBugs can be repaired with a test-suite adequate patch; 2) a total of 64 plausible patches for those 15 buggy programs in the QuixBugs are present in the search space of the considered tools; 3) the three patch assessment techniques discard in total 33 / 64 patches that are overfitting. This sets a baseline for future research of automatic repair on QuixBugs. Our experiment also highlights the major properties and challenges of how to perform automated correctness assessment of program repair patches. All experimental results are publicly available on Github in order to facilitate future research on automatic program repair.

show abstract

Section: Introductionsupporting

confidence: 56%

Section: B Methodologymentioning

confidence: 99%

Section: B Methodologymentioning

confidence: 99%

Section: A Research Questionsmentioning

confidence: 99%

See 2 more Smart Citations

A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark

Yang

Martínez

Durieux

et al. 2019

2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This study focuses on test-suite adequate patches, which means that the generated patches make the test suite pass; yet, there is no guarantee that they fix the bugs. Studying patch correctness [19,44,49] is out of the scope of this work. Our goal is to analyze the current state of the automatic program repair tools and identify potential flaws and improvements.…”

Section: Discussionmentioning

confidence: 99%

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts

Durieux

Madeiral

Martínez

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

Self Cite

117

124

View full text Add to dashboard Cite

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.

show abstract