Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Mastropaolo, Antonio; Scalabrino, Simone; Cooper, Nathan; Palacio, David N.; Poshyvanyk, Denys; Oliveto, Rocco; Bavota, Gabriele

doi:10.1109/icse43902.2021.00041

Cited by 160 publications

(73 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CuBERT employs BERT's powerful masked language modeling objective to derive generic code-specific representation, and CodeBERT further adds a replaced token detection (Clark et al, 2020) encoder-decoder models based on T5 for programming language pre-training and support a more comprehensive set of tasks. Some emerging work (Clement et al, 2020;Mastropaolo et al, 2021;Elnaggar et al, 2021) in the recent literature also explore the T5 framework on code, but they only focus on a limited subset of generation tasks and do not support understanding tasks like us. Apart from these, PLBART (Ahmad et al, 2021) based on another encoder-decoder model BART can also support both understanding and generation tasks.…”

Section: Related Workmentioning

confidence: 99%

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang¹,

Wang²,

Joty³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

468

264

View full text Add to dashboard Cite

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https: //github.com/salesforce/CodeT5.

show abstract

Section: Related Workmentioning

confidence: 99%

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang¹,

Wang²,

Joty³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

468

264

View full text Add to dashboard Cite

show abstract

“…For instance, there have been numerous investigations into techniques for code summarization, from early approaches using text retrieval [11] through to more recent ML-based approaches that are often framed as a Neural Machine Translation (NMT) problem (e.g., see recent works by Mastropaolo et al [17] and Haque et al [12]). Generally, deep learning approaches for code summarization try to infer information that is not explicitly in the code, being trained on code/comment pairs (such as LeClair and McMillan's dataset [14]) and evaluated on regular (i.e., not decompiled) source code.…”

Section: Reading List: Related Workmentioning

confidence: 99%

Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

Pearce¹,

Tan²,

Krishnamurthy³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models (such as OpenAI's Codex) have demonstrated impressive zero-shot multi-task capabilities in the software domain, including code explanation. In this work, we examine if this ability can be used to help with reverse engineering. Specifically, we investigate prompting Codex to identify the purpose, capabilities, and important variable names or values from code, even when the code is produced through decompilation. Alongside an examination of the model's responses in answering open-ended questions, we devise a true/false quiz framework to characterize the performance of the language model. We present an extensive quantitative analysis of the measured performance of the language model on a set of program purpose identification and information extraction tasks: of the 136,260 questions we posed, it answered 72,754 correctly. A key takeaway is that while promising, LLMs are not yet ready for zero-shot reverse engineering.

show abstract

“…Other pretrained transformers used on source code include CodeT5 (Wang et al, 2021b). Code-Trans (Elnaggar et al, 2021), PyMT5 (Clement et al, 2020), CuBERT (Kanade et al, 2020), PLBART , ProphetNet-X (Qi et al, 2021), CoTexT (Phan et al, 2021), T5-Code (Mastropaolo et al, 2021), GraphCode-BERT , and AlphaCode (Li et al, 2022). Pretrained GPT-style Models for source code generation include CodeGPT , and GPT-Codex (Chen et al, 2021a).…”

Section: Pretrained Transformer Modelsmentioning

confidence: 99%

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Al-Hossami¹,

Shaikh²

2022

Preprint

View full text Add to dashboard Cite

In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities.

show abstract

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Cited by 160 publications

References 48 publications

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Contact Info

Product

Resources

About