Using Transfer Learning for Code-Related Tasks

Mastropaolo, Antonio; Cooper, Nathan; Palacio, David N.; Scalabrino, Simone; Poshyvanyk, Denys; Oliveto, Rocco; Bavota, Gabriele

doi:10.1109/tse.2022.3183297

Cited by 24 publications

(11 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As representative of transformers [1], we adopt the T5 proposed by Raffel et al [20], that has been already used in SE to automate code-related tasks [9], [13], [14], [58], [59]. Masks X% of tokens (usually 15%) in the instance (e.g., a function) and asks the model to guess the masked tokens based on their bidirectional context.…”

Section: A Transformer Modelmentioning

confidence: 99%

“…Then, a labeled dataset mapping English sentences to their corresponding German translation can be used to fine-tune the model. Several works applying DL to SE report boost of performance 1 provided by pre-training in the automation of code-related tasks [11], [13], [14]. However, little is known about (i) the circumstances in which pre-training actually helps, and (ii) the impact of the specific pre-training objective(s) adopted on the performance of transformers when automating code-related tasks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automating Code-Related Tasks Through Transformers: The Impact of Pre-training

Tufano¹,

Pascarella²,

Bavota³

2023

Preprint

View full text Add to dashboard Cite

Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then finetuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. For example, in the case of code summarization, a tailored pretraining objective could be the identification of an appropriate name for a given method, considering the method name to generate as an extreme summary. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pretrained models with non pre-trained ones and show the advantage brought by pre-training in different scenarios, in which more or less fine-tuning data are available. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.

show abstract

Section: A Transformer Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Automating Code-Related Tasks Through Transformers: The Impact of Pre-training

Tufano¹,

Pascarella²,

Bavota³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Mastropaolo et al [104] propose pre-trained text-to-text transfer transformer (T5) to address four code-related tasks, namely automatic bug fixing, injection of code mutants, generation of assert statements in test methods, and code summarization. They apply BFP small and BFP medium datasets to train and evaluate the bug-fixing task, and then compare other state-of-art learningbased APR tools on the same benchmark.…”

Section: Universalmentioning

confidence: 99%

A Survey of Learning-based Automated Program Repair

Zhang¹,

Fang²,

Ma³

et al. 2023

Preprint

View full text Add to dashboard Cite

Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in software development and maintenance. With the recent advances in deep learning (DL), an increasing number of APR techniques have been proposed to leverage neural networks to learn bug-fixing patterns from massive opensource code repositories. Such learning-based techniques usually treat APR as a neural machine translation (NMT) task, where buggy code snippets (i.e., source language) are translated into fixed code snippets (i.e., target language) automatically. Benefiting from the powerful capability of DL to learn hidden relationships from previous bug-fixing datasets, learning-based APR techniques have achieved remarkable performance.In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the learning-based APR community. We illustrate the general workflow of learning-based APR techniques and detail the crucial components, including fault localization, patch generation, patch ranking, patch validation, and patch correctness phases. We then discuss the widely-adopted datasets and evaluation metrics and outline existing empirical studies. We discuss several critical aspects of learning-based APR techniques, such as repair domains, industrial deployment, and the open science issue. We highlight several practical guidelines on applying DL techniques for future APR studies, such as exploring explainable patch generation and utilizing code features. Overall, our paper can help researchers gain a comprehensive understanding about the achievements of the existing learning-based APR techniques and promote the practical application of these techniques. Our artifacts are publicly available at https://github.com/QuanjunZhang/AwesomeLearningAPR. CCS Concepts: • Software and its engineering → Software testing and debugging; Software testing and debugging.

show abstract

“…We run our specificationinference tool on them and, after a filtering procedure where duplicates and invalid Dockerfiles are removed, we end up with a set of 670,982 unique pairs HLS, Dockerfile . We use this dataset to train and test a state-of-the-art DL model, the Text-to-Text Transfer Transformer (T5) [17], which has been proven effective when supporting several coding tasks [13], [14], following the same pipeline defined in the literature. We compare the DL-based approach with two Information Retrieval (IR)-based approaches (i.e., less complex and lessresource-requiring alternatives), and we check to what extent, given a HLS, the output Dockerfiles of the three techniques: (i) meet the input requirements, (ii) are similar to the target Dockerfile, and (iii) allow to build a Docker image similar to the target one.…”

Section: Introductionmentioning

confidence: 99%

“…The stop criterion we adopt, which is the one currently used for coding tasks [14], is based on the convergence in terms of BLEU-4 score. However, considering our results, it seems to be ineffective in the evaluated context.…”

Section: Introductionmentioning

confidence: 99%

Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises

Rosa¹,

Mastropaolo²,

Scalabrino³

et al. 2023

Preprint

View full text Add to dashboard Cite

Containerization allows developers to define the execution environment in which their software needs to be installed. Docker is the leading platform in this field, and developers that use it are required to write a Dockerfile for their software. Writing Dockerfiles is far from trivial, especially when the system has unusual requirements for its execution environment. Despite several tools exist to support developers in writing Dockerfiles, none of them is able to generate entire Dockerfiles from scratch given a high-level specification of the requirements of the execution environment. In this paper, we present a study in which we aim at understanding to what extent Deep Learning (DL), which has been proven successful for other coding tasks, can be used for this specific coding task. We preliminarily defined a structured natural language specification for Dockerfile requirements and a methodology that we use to automatically infer the requirements from the largest dataset of Dockerfiles currently available. We used the obtained dataset, with 670,982 instances, to train and test a Text-to-Text Transfer Transformer (T5) model, following the current state-of-the-art procedure for coding tasks, to automatically generate Dockerfiles from the structured specifications. The results of our evaluation show that T5 performs similarly to the more trivial IR-based baselines we considered. We also report the open challenges associated with the application of deep learning in the context of Docker file generation.Index Terms-docker, deep learning 1 FROM tomcat:7.0.75-jre8 2 3 RUN echo deb http://archive.ubuntu.com/ubuntu precise universe multiverse >> /etc/apt/sources.list; apt-get update && \ 4 apt-get -y --fix-missing install autoconf automake build-essential \ 5 git mercurial cmake libass-dev libgpac-dev libtheora-dev libtool \ 6 libvdpau-dev libvorbis-dev pkg-config texi2html zlib1g-dev \ 7 libmp3lame-dev wget yasm && \ 8 apt-get clean 9 10 WORKDIR /usr/local/src 11 # Install x265 12 RUN hg clone https:

show abstract

Using Transfer Learning for Code-Related Tasks

Cited by 24 publications

References 59 publications

Automating Code-Related Tasks Through Transformers: The Impact of Pre-training

Automating Code-Related Tasks Through Transformers: The Impact of Pre-training

A Survey of Learning-based Automated Program Repair

Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises

Contact Info

Product

Resources

About