CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang, Yue; Wang, Weishi; Joty, Shafiq; Hoi, Steven C. H.

doi:10.48550/arxiv.2109.00859

Cited by 55 publications

(103 citation statements)

References 19 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, some pretrained models for Natural Languages (NL) like BERT [12] and GPT-3 [8] have recently demonstrated excellent transferability to Programming Languages (PL) and stronger capabilities of capturing semantics information than code2vec or code2seq. Inspired by the success of these language models, pre-trained models of code have recently become more and more popular in the field of code intelligence and benefited a broad range of tasks [9,14,20,26,42,46]. These current pre-trained models of code can be divided into two types: embedding models and generative models.…”

Section: Pre-trained Models Of Codementioning

confidence: 99%

Natural Attack for Pre-trained Models of Code

Yang,

Shi,

et al. 2022

Preprint

View full text Add to dashboard Cite

Pre-trained models of code have achieved success in many important software engineering tasks. However, these powerful models are vulnerable to adversarial attacks that slightly perturb model inputs to make a victim model produce wrong outputs. Current works mainly attack models of code with examples that preserve operational program semantics but ignore a fundamental requirement for adversarial example generation: perturbations should be natural to human judges, which we refer to as naturalness requirement.In this paper, we propose ALERT (Naturalness Aware Attack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs. Different from prior works, this paper considers the natural semantic of generated examples at the same time as preserving the operational semantic of original inputs. Our user study demonstrates that human developers consistently consider that adversarial examples generated by ALERT are more natural than those generated by the state-of-the-art work by Zhang et al. that ignores the naturalness requirement. On attacking CodeBERT, our approach can achieve attack success rates of 53.62%, 27.79%, and 35.78% across three downstream tasks: vulnerability prediction, clone detection and code authorship attribution. On GraphCodeBERT, our approach can achieve average success rates of 76.95%, 7.96% and 61.47% on the three tasks. The above outperforms the baseline by 14.07% and 18.56% on the two pretrained models on average. Finally, we investigated the value of the generated adversarial examples to harden victim models through an adversarial fine-tuning procedure and demonstrated the accuracy of CodeBERT and GraphCodeBERT against ALERT-generated adversarial examples increased by 87.59% and 92.32%, respectively. CCS CONCEPTS• Software and its engineering → Software testing and debugging; Search-based software engineering; • Computing methodologies → Neural networks.

show abstract

Section: Pre-trained Models Of Codementioning

confidence: 99%

Natural Attack for Pre-trained Models of Code

Yang,

Shi,

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To probe the limits of successful transfer further, we next asked whether pretraining with programming languages, as opposed to natural languages, would also improve generalization in downstream semantic parsing tasks. To test this, we took a recent language model called CodeT5 (denoted by ct5 base in Tables 1-2), which was pretrained predominantly on several different programming languages (Wang et al, 2021). We note that the pretraining data for this model involved some amount of natural language as well, however, so the model was not pretrained exclusively with programming languages (for more details on the pretraining data and the pretraining tasks for this model, please see Wang et al (2021)).…”

Section: Resultsmentioning

confidence: 99%

“…To test this, we took a recent language model called CodeT5 (denoted by ct5 base in Tables 1-2), which was pretrained predominantly on several different programming languages (Wang et al, 2021). We note that the pretraining data for this model involved some amount of natural language as well, however, so the model was not pretrained exclusively with programming languages (for more details on the pretraining data and the pretraining tasks for this model, please see Wang et al (2021)). Remarkably, CodeT5 also substantially improved generalization in both SCAN and COGS (Tables 1-2).…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Compositional generalization in semantic parsing with pretrained transformers

Orhan¹

2021

Preprint

View full text Add to dashboard Cite

Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.

show abstract

“…The pretraining objectives used include masked language modeling, code structure edges, and representation alignment between source code and code structure. Other pretrained transformers used on source code include CodeT5 (Wang et al, 2021b). Code-Trans (Elnaggar et al, 2021), PyMT5 (Clement et al, 2020), CuBERT (Kanade et al, 2020), PLBART , ProphetNet-X (Qi et al, 2021), CoTexT (Phan et al, 2021), T5-Code (Mastropaolo et al, 2021), GraphCode-BERT , and AlphaCode (Li et al, 2022).…”

Section: Pretrained Transformer Modelsmentioning

confidence: 99%

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Al-Hossami¹,

Shaikh²

2022

Preprint

View full text Add to dashboard Cite

In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities.

show abstract

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Cited by 55 publications

References 19 publications

Natural Attack for Pre-trained Models of Code

Natural Attack for Pre-trained Models of Code

Compositional generalization in semantic parsing with pretrained transformers

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

Contact Info

Product

Resources

About