IntelliCode Compose: Code Generation Using Transformer

Svyatkovskiy, A.; Deng, Shao Kun; Fu, Shihua; Sundaresan, Neel

doi:10.48550/arxiv.2005.08025

Cited by 16 publications

(33 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pre-Trained Models for Programming Languages Inspired by the big success of pre-training in NLP (Devlin et al, 2018;Yang et al, 2019;Raffel et al, 2019), pre-trained models for programming languages also promotes the development of code intelligence (Kanade et al, 2019;Feng et al, 2020;Karampatsis & Sutton, 2020;Svyatkovskiy et al, 2020;Buratti et al, 2020). Kanade et al (2019) pre-train a BERT model on a massive corpus of Python source codes by masked language modeling and next sentence prediction objectives.…”

Section: Related Workmentioning

confidence: 99%

“…The success of pre-trained models in NLP also promotes the development of pre-trained models for programming language. Existing works (Kanade et al, 2019;Karampatsis & Sutton, 2020;Feng et al, 2020;Svyatkovskiy et al, 2020;Buratti et al, 2020) regard a source code as a sequence of tokens and pre-train models on source code to support code-related tasks such as code search, code completion, code summarization, etc. However, previous works only utilize source code for pre-training, while ignoring the inherent structure of code.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GraphCodeBERT: Pre-training Code Representations with Data Flow

Guo¹,

Ren²,

Lu³

et al. 2020

Preprint

Self Cite

117

View full text Add to dashboard Cite

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntacticlevel structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

GraphCodeBERT: Pre-training Code Representations with Data Flow

Guo¹,

Ren²,

Lu³

et al. 2020

Preprint

Self Cite

117

View full text Add to dashboard Cite

show abstract

“…In recent years, several efforts have focused on the use of AI and machine learning techniques for various tasks related to software engineering, including code completion [15,41,75,84], code classification [49,68], API recommendation [16,33], variable and method naming [3,5], type inference [39,93], bug detection and repair [25,40,71,74,89,95], comment description and generation [4,44,48,65,80,91], code change summarization [66], and code clone detection [96]. A significant portion of this work is recounted in Allamanis et al 's survey of the area [2].…”

Section: Ai Techniques For Software Engineeringmentioning

confidence: 99%

Perfection Not Required? Human-AI Partnerships in Code Translation

Weisz¹,

Müller²,

Houde³

et al. 2021

26th International Conference on Intelligent User Interfaces

View full text Add to dashboard Cite

Generative models have become adept at producing artifacts such as images, videos, and prose at human-like levels of proficiency. New generative techniques, such as unsupervised neural machine translation (NMT), have recently been applied to the task of generating source code, translating it from one programming language to another. The artifacts produced in this way may contain imperfections, such as compilation or logical errors. We examine the extent to which software engineers would tolerate such imperfections and explore ways to aid the detection and correction of those errors. Using a design scenario approach, we interviewed 11 software engineers to understand their reactions to the use of an NMT model in the context of application modernization, focusing on the task of translating source code from one language to another. Our threestage scenario sparked discussions about the utility and desirability of working with an imperfect AI system, how acceptance of that system's outputs would be established, and future opportunities for generative AI in application modernization. Our study highlights how UI features such as confidence highlighting and alternate translations help software engineers work with and better understand generative NMT models. CCS CONCEPTS• Human-centered computing → HCI theory, concepts and models; • Software and its engineering → Designing software; • Computing methodologies → Generative and developmental approaches.

show abstract

“…Svyatkovskiy et al [15] introduced IntelliCode Compose, a general-purpose multilingual code completion tool capable of predicting code sequences of arbitrary token types. They do not leverage high-level structural representation, such as AST, and use subtokens to overcome the out of vocabulary problem.…”

Section: Related Workmentioning

confidence: 99%

“…Although the performance of code completion techniques substantially improved over time, the type of support they provide to developers has not evolved at the same pace, and are mostly only capable of predicting a single token. Only a few recent studies focus on predicting multiple contiguous tokens [14], [15].…”

Section: Introductionmentioning

confidence: 99%

An Empirical Study on the Usage of BERT Models for Code Completion

Ciniselli¹,

Cooper²,

Pascarella³

et al. 2021

Preprint

View full text Add to dashboard Cite

Code completion is one of the main features of modern Integrated Development Environments (IDEs). Its objective is to speed up code writing by predicting the next code token(s) the developer is likely to write. Research in this area has substantially bolstered the predictive performance of these techniques. However, the support to developers is still limited to the prediction of the next few tokens to type. In this work, we take a step further in this direction by presenting a large-scale empirical study aimed at exploring the capabilities of state-ofthe-art deep learning (DL) models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks (e.g., the iterated block of a for loop). To this aim, we train and test several adapted variants of the recently proposed RoBERTa model, and evaluate its predictions from several perspectives, including: (i) metrics usually adopted when assessing DL generative models (i.e., BLEU score and Levenshtein distance); (ii) the percentage of perfect predictions (i.e., the predicted code snippets that match those written by developers); and (iii) the "semantic" equivalence of the generated code as compared to the one written by developers. The achieved results show that BERT models represent a viable solution for code completion, with perfect predictions ranging from ∼7%, obtained when asking the model to guess entire blocks, up to ∼58%, reached in the simpler scenario of few tokens masked from the same code statement.

show abstract

IntelliCode Compose: Code Generation Using Transformer

Cited by 16 publications

References 19 publications

GraphCodeBERT: Pre-training Code Representations with Data Flow

GraphCodeBERT: Pre-training Code Representations with Data Flow

Perfection Not Required? Human-AI Partnerships in Code Translation

An Empirical Study on the Usage of BERT Models for Code Completion

Contact Info

Product

Resources

About