Is the cure worse than the disease? overfitting in automated program repair

Smith, E. K. M.; Barr, Earl T.; Goues, Claire Le; Brun, Yuriy

doi:10.1145/2786805.2786825

Cited by 253 publications

(185 citation statements)

References 67 publications

Supporting

Mentioning

175

Contrasting

Order By: Relevance

“…This gives a nuanced picture of the results, which must however be taken-as usual-with a grain of salt: different tools may focus on achieving a better ranking vs. correctly fixing more bugs, and we do not imply that there is one universal measure of effectiveness. Anyway, our evaluation is widely applicable-including to papers that may not detail this aspect-and is in line with what done in other evaluations [14], [17], [25], [28], [29].…”

Section: E Threats To Validitysupporting

confidence: 73%

“…A more detailed analysis [28] of the fixes produced by GenProg and similar techniques has shown that only a small fraction of them is genuinely correct; for example, less than 2% of the bugs of [14] are correctly fixed. [28]'s analysis has pushed the research in APR to addressing this manifestation of the overfitting problem [29].…”

Section: Related Workmentioning

confidence: 99%

“…Since validation is against a finite-often small-number of tests, there is no guarantee that a valid repair is genuinely correct against a complete, and implicit, specification of the method. Indeed, experiments have repeatedly confirmed [19], [28], [29] that automated program repair techniques are prone to producing a significant fraction of valid but incorrect repairs, which merely happen to pass all available tests but are clearly inadequate from a programmer's perspective.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Contract-based program repair without the contracts

Chen

Pei

Furia

2017

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

View full text Add to dashboard Cite

Abstract-Automated program repair (APR) is a promising approach to automatically fixing software bugs. Most APR techniques use tests to drive the repair process; this makes them readily applicable to realistic code bases, but also brings the risk of generating spurious repairs that overfit the available tests. Some techniques addressed the overfitting problem by targeting code using contracts (such as pre-and postconditions), which provide additional information helpful to characterize the states of correct and faulty computations; unfortunately, mainstream programming languages do not normally include contract annotations, which severely limits the applicability of such contract-based techniques.This paper presents JAID, a novel APR technique for Java programs, which is capable of constructing detailed state abstractions-similar to those employed by contract-based techniques-that are derived from regular Java code without any special annotations. Grounding the repair generation and validation processes on rich state abstractions mitigates the overfitting problem, and helps extend APR's applicability: in experiments with the DEFECTS4J benchmark, a prototype implementation of JAID produced genuinely correct repairs, equivalent to those written by programmers, for 25 bugs-improving over the state of the art of comparable Java APR techniques in the number and kinds of correct fixes.

show abstract

Section: E Threats To Validitysupporting

confidence: 73%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Contract-based program repair without the contracts

Chen

Pei

Furia

2017

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

View full text Add to dashboard Cite

show abstract

“…The second thing is that we strongly advice researchers evaluate the correctness rate of their automatic repair as well as fixing rate, no matter by human inspection or ground truth comparison in benchmark. Existing repair systems may fail to generate true patch due to test suite overfitting [61], [62], which is a concept in statistics or machine learning. Here overfitting means that the repair helps program perform well within test suite while fail in real usage.…”

Section: Impact Of Test Suite Qualitymentioning

confidence: 99%

A Survey of Test Based Automatic Program Repair

Liu¹,

Long²,

Zhang³

2018

JSW

View full text Add to dashboard Cite

Abstract:Testing and debugging have always been the most time-consuming parts of the software development procedure and require large amounts of human resources. When a bug is located, manually fixing it to repair the buggy program is still a difficult and laborious task for developers. Hence automatic program repair techniques, especially the test-based approaches, have drawn great attentions in recent years. Researchers have explored and proposed various novel methods and tools, pushing the idea closer to reality. In this paper, we systematically survey the work in mainstream of test-based program repair (TBR) and discuss the properties automatically generated patches should have. We classify the state-of-the-art approaches for TBR, and evaluate their strengths and weaknesses according to their functional mechanisms. Finally, we refer to some empirical results and propose four important issues, which are supposed to be critical and constructive in this research area.

show abstract

“…GenProg itself is derived from earlier work by Weimar et al [11,37,39]. Smith et al is an example of a more recent use of this framework [34]. Although GenProg is perhaps the most commonly known framework, there is also a large body of GI literature dedicated to automatic bug fixing [1,2,4,28,31] that use alternative approaches.…”

Section: Related Workmentioning

confidence: 99%

Exploring Fitness and Edit Distance of Mutated Python Programs

Haraldsson

Woodward

Brownlee

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Genetic Improvement (GI) is the process of using computational search techniques to improve existing software e.g. in terms of execution time, power consumption or correctness. As in most heuristic search algorithms, the search is guided by fitness with GI searching the space of program variants of the original software. The relationship between the program space and fitness is seldom simple and often quite difficult to analyse. This paper makes a preliminary analysis of GI's fitness distance measure on program repair with three small Python programs. Each program undergoes incremental mutations while the change in fitness as measured by proportion of tests passed is monitored.We conclude that the fitnesses of these programs often does not change with single mutations and we also confirm the inherent discreteness of bug fixing fitness functions. Although our findings cannot be assumed to be general for other software they provide us with interesting directions for further investigation.

show abstract

Is the cure worse than the disease? overfitting in automated program repair

Cited by 253 publications

References 67 publications

Contract-based program repair without the contracts

Contract-based program repair without the contracts

A Survey of Test Based Automatic Program Repair

Exploring Fitness and Edit Distance of Mutated Python Programs

Contact Info

Product

Resources

About