An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

Tufano, Michele; Watson, Corey T.; Bavota, Gabriele; Penta, Massimiliano Di; White, Martin; Poshyvanyk, Denys

doi:10.1145/3340544

Cited by 220 publications

(308 citation statements)

References 97 publications

Supporting

Mentioning

301

Contrasting

Order By: Relevance

“…During patch inference, we still generate abstract buggy context for the bug, as described in Section 3.2. But we will use beam search to generate multiple likely patches for the same buggy line, as done in related work [10], [24]. Beam search works by keeping the n best sequences up to the current decoder state.…”

Section: Patch Inferencementioning

confidence: 99%

“…Vocabulary In this paper, we consider a vocabulary of the 1,000 most common tokens. To the best of our knowledge, this is one of the largest vocabularies considered for machine learning for patch generation: for comparison, DeepFix [27] has a vocabulary size of 129 words, and Tufano et al [10] considered a vocabulary size of 430 words.…”

Section: Implementation Details and Parameter Settingsmentioning

confidence: 99%

“…• Token embedding (our model uses the same embedding for both g e and g d ): 1,004x256 (1,000 + 4 special tokens) We use a beam size of 50 during inference, which is the default value used in the literature [10] [24] and which proves to be good empirically.…”

Section: Implementation Details and Parameter Settingsmentioning

confidence: 99%

“…Second, we report on using the copy mechanism on seq-to-seq learning on source code. Third, on the same buggy input dataset, SEQUENCER is able to produce the correct patch for 119% more samples than the closest related work [10].…”

Section: Introductionmentioning

confidence: 96%

“…Our golden trained model is able to perfectly fix 950/4,711 testing samples. To the best-of-our knowledge, this is the best result reported on such a task at the time of writing this paper [10][11] [12]. • We evaluate our approach on the 75 one-line bugs of Defects4J, which is the most widely used benchmark for evaluating programming repair contributions.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair

Chen

Kommrusch

Tufano

et al. 2021

IIEEE Trans. Software Eng.

Self Cite

190

339

View full text Add to dashboard Cite

This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a technique, called SEQUENCER, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 samples, carefully curated from commits to open-source repositories. We evaluate SEQUENCER on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SEQUENCER is able to perfectly predict the fixed line for 950/4,711 testing samples, and find correct patches for 14 bugs in Defects4J benchmark. SEQUENCER captures a wide range of repair operators without any domain-specific top-down design.Index Terms-program repair; machine learning. ! Zimin Chen is currently a PhD student at KTH Royal Institute of Technology. He also received the BS and MS degree in computer science from KTH. His research interest lies in the intersection between machine learning and software engineering, especially between automatic program repair and machine learning.Steve Kommrusch is currently a PhD candidate focused on machine learning at Colorado State University. He received his BS in computer engineering from University of Illinois in 1987 and his MS in EECS from MIT in 1989. From 1989 through 2017, he worked in industry at Hewlett-Packard, National Semiconductor, and Advanced Micro Devices. Steve holds over 30 patents in the fields of computer graphics algorithms, silicon simulation and debug techniques, and silicon performance and power management. His research interests include Program Equivalence, Program Repair, and Constructivist AI using machine learning. Electrical and Computer Engineering department. He is working on patternspecific languages and compilers for scientific computing, and has designed numerous approaches using optimizing compilation to effectively map applications to CPUs, GPUs, FPGAs and System-on-Chips. His work spans a variety of domains including compiler optimization design especially in the polyhedral compilation framework, high-level synthesis for FPGAs and SoCs, and distributed computing. Previously

show abstract

Section: Patch Inferencementioning

confidence: 99%

Section: Implementation Details and Parameter Settingsmentioning

confidence: 99%

Section: Implementation Details and Parameter Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 96%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair

Chen

Kommrusch

Tufano

et al. 2021

IIEEE Trans. Software Eng.

Self Cite

190

339

View full text Add to dashboard Cite

show abstract

RobustNPR: Evaluating the robustness of neural program repair models

Liu

et al. 2023

J Software Evolu Process

View full text Add to dashboard Cite

Due to the high cost of repairing defective programs, many researches focus on automatic program repair (APR). In recent years, the new trend of APR is to apply neural networks to mine the relations between defective programs and corresponding patches automatically, which is known as neural program repair (NPR). The community, however, ignores some important properties that could impact the applicability of NPR systems, such as robustness. For semantic‐identical buggy programs, NPR systems may produce totally different patches. In this paper, we propose an evaluation tool named RobustNPR, the first NPR robustness evaluation tool. RobustNPR employs several mutators to generate semantic‐identical mutants of defective programs. For an original defective program and its mutant, it checks two aspects of NPR: (a) Can NPR fix mutants when it can fix the original defective program? and (b) can NPR generate semantic‐identical patches for the original program and the mutant? Then, we evaluate four SOTA NPR models and analyze the results. From the results, we find that even for the best‐performing model, 20.16% of the repair success is unreliable, which indicates that the robustness of NPR is not perfect. In addition, we find that the robustness of NPR is correlated with model settings and other factors.

show abstract

Fine-Tuning GPT-2 to Patch Programs, Is It Worth It?

Lajkó

Horváth

Csuvik

et al. 2022

Computational Science and Its Applications – ICCSA 2022 Workshops

View full text Add to dashboard Cite

The application of Articial Intelligence (AI) in the Software Engineering (SE) eld is always a bit delayed compared to state-ofthe-art research results. While the Generative Pre-trained Transformer (GPT-2) model was published in 2018, only a few recent works used it to SE tasks. One of such task is Automated Program Repair (APR), where the applied technique should nd a x to software bugs without human intervention. One problem emerges here: the creation of proper training data is resource intensive and requires several hours of additional work from researchers. The sole reason of it is that training a model to repair programs automatically requires both the buggy program and the xed one in large scale and presumably in an already pre-processed form. There are currently few such databases, so teaching and ne-tuning models is not an easy task. In this work we wanted to investigate how the GPT-2 model performs when it is not ne-tuned for the APR task, compered to when it is ne-tuned. From previous work we already know that the GPT-2 model can automatically generate patches for buggy programs, although the literature lacks of studies where no ne-tuning has taken place. For the sake of experiment we evaluated the GPT-2 model out-of-the-box and also ne-tuned it before the evaluation on 1559 JavaSript code snippets. Based on out results we can conclude that although the ne-tuned model was able to learn how to write syntactically correct source code almost on every attempt, the non-ne-tuned model lacked some of these positive features.

show abstract

An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

Cited by 220 publications

References 97 publications

SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair

SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair

RobustNPR: Evaluating the robustness of neural program repair models

Fine-Tuning GPT-2 to Patch Programs, Is It Worth It?

Contact Info

Product

Resources

About