CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang, Yue; Wang, Weishi; Joty, Shafiq; Hoi, Steven C. H.

doi:10.18653/v1/2021.emnlp-main.685

Cited by 468 publications

(371 citation statements)

References 55 publications

Supporting

Mentioning

264

Contrasting

Unclassified

Order By: Relevance

“…• Code translation, in which a generative model translates source code from one language (e.g., Java) to another (e.g., Python). This task has been an important benchmark for technical work in GenAI for code [61] and has gained extensive attention from both industry and academia [3,23,30,88,99]. Such technologies can significantly reduce the cost and expertise barriers for code modernization work, in which a legacy codebase is ported to a modern programming language.…”

Section: Use Cases and Scenariosmentioning

confidence: 99%

Investigating Explainability of Generative AI for Code through Scenario-based Design

Sun

Liao²,

Müller

et al. 2022

27th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

What does it mean for a generative AI model to be explainable? The emergent discipline of explainable AI (XAI) has made great strides in helping people understand discriminative models. Less attention has been paid to generative models that produce artifacts, rather than decisions, as output. Meanwhile, generative AI (GenAI) technologies are maturing and being applied to application domains such as software engineering. Using scenario-based design and question-driven XAI design approaches, we explore users' explainability needs for GenAI in three software engineering use cases: natural language to code, code translation, and code autocompletion. We conducted 9 workshops with 43 software engineers in which real examples from state-of-the-art generative AI models were used to elicit users' explainability needs. Drawing from prior work, we also propose 4 types of XAI features for GenAI for code and gathered additional design ideas from participants. Our work explores explainability needs for GenAI for code and demonstrates how human-centered approaches can drive the technical development of XAI in novel domains. CCS CONCEPTS• Computing methodologies → Natural language generation; • Software and its engineering; • Human-centered computing → Human computer interaction (HCI); User studies;

show abstract

Section: Use Cases and Scenariosmentioning

confidence: 99%

Investigating Explainability of Generative AI for Code through Scenario-based Design

Sun

Liao²,

Müller

et al. 2022

27th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

show abstract

“…In this work, we present GAP-Gen, a method to improve automatic Python source code generation from natural language description. Our GAP-Gen is fine-tuning of the pre-trained T5-English (Raffel et al, 2020a) and CodeT5 (Wang et al, 2021) language models that employ Syntax-Flow and Variable-Flow as guidance and has shown on being able to understand the relationship between natural language description and Python code from syntactic and semantic level of the Python code.…”

Section: Codementioning

confidence: 99%

“…More related to our work, (Husain et al, 2019;Clement et al, 2020) explore pre-training methodologies for learning better structural and syntactical information for automatic code generation. Moreover, (Wang et al, 2021;Guo et al, 2021) incorporates Variable-Flows and identifier information into their pre-training process for better code generation performance.…”

Section: Related Workmentioning

confidence: 99%

“…It can be effectively applied for maintaining the naming semantics of the code during the code generation process. (Wang et al, 2021;Guo et al, 2021) use Variable-Flow during their pre-training process and achieve good performances on programming language relevant tasks. In their works, they extract function variables names as Variable-Flow which is integrated into their pre-training process for improving langauage models' capability on understanding the code semantic structure.…”

Section: Variable-flowmentioning

confidence: 99%

See 1 more Smart Citation

GAP-Gen: Guided Automatic Python Code Generation

Zhao¹,

Song²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automatic code generation from natural language descriptions can be highly beneficial during the process of software development.In this work, we propose GAP-Gen, an automatic code generation method guided by Python syntactic constraints and semantic constraints. We first introduce Python syntactic constraints in the form of Syntax-Flow, which is a simplified version of Abstract Syntax Tree (AST) reducing the size and high complexity of Abstract Syntax Tree but maintaining the crucial syntactic information of Python code.In addition to Syntax-Flow, we introduce Variable-Flow which abstracts variable and function names consistently throughout the code. In our work, rather than pre-training, we focus on modifying the fine-tuning process which reduces computational requirements but retains high generation performance on automatic Python code generation task. GAP-Gen fine-tunes the transformer-based language models T5 and CodeT5 using the Code-to-Docstring datasets CodeSearchNet, CodeSearchNet AdvTest and Code-Docstring-Corpus from EdinburghNLP. Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works. 1

show abstract

“…However, existing code pre-training approaches directly adopt (masked) language modeling as the training objective which targets on learning to predict (masked) tokens in a given code context (Feng et al, 2020;Guo et al, 2021;Wang et al, 2021b). However, this token-based approach generally results in poor code semantic representations due to two reasons.…”

Section: Introductionmentioning

confidence: 99%

CodeRetriever: Unimodal and Bimodal Contrastive Learning

Li¹,

Gong²,

Shen³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train functionlevel code semantic representations, specifically for the code search task. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and inline comments of code to build text-code pairs. Both contrastive objectives can fully leverage the large-scale code corpus for pretraining. Experimental results on several public benchmarks, (i.e., CodeSearch, CoSQA, etc.) demonstrate the effectiveness of CodeRetriever in the zero-shot setting. By fine-tuning with domain/language specified downstream data, CodeRetriever achieves the new state-ofthe-art performance with significant improvement over existing code pre-trained models. We will make the code, model checkpoint, and constructed datasets publicly available.

show abstract

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Cited by 468 publications

References 55 publications

Investigating Explainability of Generative AI for Code through Scenario-based Design

Investigating Explainability of Generative AI for Code through Scenario-based Design

GAP-Gen: Guided Automatic Python Code Generation

CodeRetriever: Unimodal and Bimodal Contrastive Learning

Contact Info

Product

Resources

About