Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.685
|View full text |Cite
|
Sign up to set email alerts
|

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Abstract: Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-traine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
264
0
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 468 publications
(371 citation statements)
references
References 55 publications
3
264
0
2
Order By: Relevance
“…• Code translation, in which a generative model translates source code from one language (e.g., Java) to another (e.g., Python). This task has been an important benchmark for technical work in GenAI for code [61] and has gained extensive attention from both industry and academia [3,23,30,88,99]. Such technologies can significantly reduce the cost and expertise barriers for code modernization work, in which a legacy codebase is ported to a modern programming language.…”
Section: Use Cases and Scenariosmentioning
confidence: 99%
“…• Code translation, in which a generative model translates source code from one language (e.g., Java) to another (e.g., Python). This task has been an important benchmark for technical work in GenAI for code [61] and has gained extensive attention from both industry and academia [3,23,30,88,99]. Such technologies can significantly reduce the cost and expertise barriers for code modernization work, in which a legacy codebase is ported to a modern programming language.…”
Section: Use Cases and Scenariosmentioning
confidence: 99%
“…In this work, we present GAP-Gen, a method to improve automatic Python source code generation from natural language description. Our GAP-Gen is fine-tuning of the pre-trained T5-English (Raffel et al, 2020a) and CodeT5 (Wang et al, 2021) language models that employ Syntax-Flow and Variable-Flow as guidance and has shown on being able to understand the relationship between natural language description and Python code from syntactic and semantic level of the Python code.…”
Section: Codementioning
confidence: 99%
“…More related to our work, (Husain et al, 2019;Clement et al, 2020) explore pre-training methodologies for learning better structural and syntactical information for automatic code generation. Moreover, (Wang et al, 2021;Guo et al, 2021) incorporates Variable-Flows and identifier information into their pre-training process for better code generation performance.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, existing code pre-training approaches directly adopt (masked) language modeling as the training objective which targets on learning to predict (masked) tokens in a given code context (Feng et al, 2020;Guo et al, 2021;Wang et al, 2021b). However, this token-based approach generally results in poor code semantic representations due to two reasons.…”
Section: Introductionmentioning
confidence: 99%