Codeflaws: a programming competition benchmark for evaluating automated program repair tools

Tan, Shin Hwei; Yi, Jooyong; Yulis,; Mechtaev, Sergey; Roychoudhury, Abhik

doi:10.1109/icse-c.2017.76

Cited by 52 publications

(36 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To account for these requirements, we used the Codeflaws benchmark (Tan et al 2017). This benchmark consists of 7,436 programs (among wich 3,902 are faulty) selected from the Codeforces 1 online database of programming contests.…”

Section: Benchmarks: Programs and Fault(s)mentioning

confidence: 99%

Selecting fault revealing mutants

et al. 2019

View full text Add to dashboard Cite

Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones that should be used by the testers. In view of this, we investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are killable and lead to test cases that uncover unknown program faults. We formulate two variants of this problem: the fault revealing mutant selection and the fault revealing mutant prioritization. We argue and show that these problems can be tackled through a set of 'static' program features and propose a machine learning approach, named FaRM, that learns to select and rank killable and fault revealing mutants. Experimental results involving 1,692 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms all the existing mutant selection methods, i.e., the random mutant sampling, the selective mutation and defect prediction (mutating the code areas pointed by defect prediction). In particular, our results show that with respect to mutant selection, our approach reveals 23% to 34% more faults than any of the baseline methods, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference between 4% and 9% (from the random mutant orderings).

show abstract

Section: Benchmarks: Programs and Fault(s)mentioning

confidence: 99%

Selecting fault revealing mutants

et al. 2019

View full text Add to dashboard Cite

show abstract

“…To evaluate our approach we used CodeFlaws [5]. The benchmark has 3,902 faulty program versions of 40 defect classes.…”

Section: Resultsmentioning

confidence: 99%

“…This way testers can focus on the most promising mutants and apply mutation on a best-e ort basis. Experimental results using 10-fold cross validation on 1,629 faults, from the CodeFlaws benchmark [5], show a high performance of our approach. In particular our mutant selection method achieves signi cantly better results than random mutant selection by revealing 12% to 20% more faults.…”

Section: Introductionmentioning

confidence: 99%

Predicting the fault revelation utility of mutants

Chekam

Papadakis

Bissyandé

et al. 2018

Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

View full text Add to dashboard Cite

Mutation testing is one of the strongest code-based test criteria. However, it is expensive as it involves a large number of mutants.To deal with this issue we propose a machine learning approach that learns to select fault revealing mutants. Fault revealing mutants are valuable to testers as their killing results in (collateral) fault revelation. We thus, formulate mutant reduction as the problem of selecting the mutants that are most likely to lead to test cases that uncover unknown program faults. We tackle this problem using a set of static program features and machine learning. Experimental results involving 1,629 real faults show that our approach reveals 14% to 18% more faults than a random mutant selection baseline.

show abstract

“…Ensuring that bugs can be reliably reproduced allows datasets to be used for a rich diversity of studies, including testing, fault localisation, and automated program repair, as similar datasets for non-robotic systems (Le Goues et al, 2015;Just et al, 2014;Tan et al, 2017;Sahoo et al, 2010;Do et al, 2005;Henningsson and Wohlin, 2004) have demonstrated in broader contexts. These studies inspire our work to recreate and detect robotics and autonomous systems bugs in simulation, with a view towards detecting new bugs, which is a direction the previous work does not take.…”

Section: Introductionmentioning

confidence: 99%

“…The DEFECTS4J (Just et al, 2014) and MANYBUGS (Le Goues et al, 2015) datasets consist of historical bugs in large-scale Java and C programs, respectively. At the opposite end of the scale, the CODEFLAWS (Tan et al, 2017) and INTROCLASS (Le Goues et al, 2015) datasets are composed of bugs in small, single-file programming assignments (or challenges) completed by novices, using C. The Software Infrastructure Repository (Do et al, 2005) represents the first concerted effort to provide a dataset of reproducible faults. Unlike the aforementioned datasets, the SIR is predominantly composed of artificial bugs, and covers programs written in a variety of different languages.…”

mentioning

confidence: 99%

Crashing simulated planes is cheap: Can simulation detect robotics bugs early?

Timperley¹,

Afzal²,

Katz³

et al. 2018

EasyChair Preprints

View full text Add to dashboard Cite

Abstract-Robotics and autonomy systems are becoming increasingly important, moving from specialised factory domains to increasingly general and consumer-focused applications. As such systems grow ubiquitous, there is a commensurate need to protect against potentially catastrophic harm. System-level testing in simulation is a particularly promising approach for assuring robotics systems, allowing for more extensive testing in realistic scenarios and seeking bugs that may not manifest at the unit-level. Ideally, such testing could find critical bugs well before expensive field-testing is required. However, simulations can only model coarse environmental abstractions, contributing to a common perception that robotics bugs can only be found in live deployment. To address this gap, we conduct an empirical study on bugs that have been fixed in the widely used, opensource ARDUPILOT system. We identify bug-fixing commits by exploiting commenting conventions in the version-control history. We provide a quantitative and qualitative evaluation of the bugs, focusing on characterising how the bugs are triggered and how they can be detected, with a goal of identifying how they can be best identified in simulation, well before field testing. To our surprise, we find that the majority of bugs manifest under simple conditions that can be easily reproduced in software-based simulation. Conversely, we find that system configurations and forms of input play an important role in triggering bugs. We use these results to inform a novel framework for testing for these and other bugs in simulation, consistently and reproducibly. These contributions can inform the construction of techniques for automated testing of robotics systems, with the goal of finding bugs early and cheaply, without incurring the costs of physically testing for bugs in live systems.

show abstract

Codeflaws: a programming competition benchmark for evaluating automated program repair tools

Cited by 52 publications

References 19 publications

Selecting fault revealing mutants

Selecting fault revealing mutants

Predicting the fault revelation utility of mutants

Crashing simulated planes is cheap: Can simulation detect robotics bugs early?

Contact Info

Product

Resources

About