Program Synthesis with Large Language Models

Austin, Jacob; Odena, Augustus; Nye, Maxwell; Bosma, Maarten; Michalewski, Henryk; Dohan, D.; Jiang, Ellen; Cai, Carrie J.; Terry, Michael; Le, Quoc V.; Sutton, Charles

doi:10.48550/arxiv.2108.07732

Cited by 95 publications

(240 citation statements)

References 0 publications

Supporting

Mentioning

207

Contrasting

Order By: Relevance

“…Left-to-Right Language Models (Figure 2, left) Auto-regressive, Left-to-right LMs, predict the probability of a token given the previous tokens. In code modeling, CodeGPT (124M) (Lu et al, 2021), CodeParrot (1.5B) (Tunstall et al, 2022), GPT-Neo (2.7B) (Black et al, 2021), GPT-J (6B) (Wang & Komatsuzaki, 2021), Codex (12B) (Chen et al, 2021), GPT-NeoX (20B) (Black et al, 2022), and Google's (137B) (Austin et al, 2021) belong to this category. The left-to-right nature of these models makes them highly useful for program generation tasks, such as code completion.…”

Section: Pretraining Methodsmentioning

confidence: 99%

“…These models excel at useful downstream tasks like code completion (Raychev et al, 2014) and synthesizing code from natural language descriptions (Desai et al, 2016). The current state-of-the-art large language models for code, such as Austin et al (2021), have shown significant progress for AI-based programming assistance. Most notably, one of the largest of these models, Codex (Chen et al, 2021) has been deployed in the real-world production tool GitHub Copilot 1 , as an in-IDE developer assistant that automatically generates code based on the user's context.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Systematic Evaluation of Large Language Models of Code

Xu¹,

Alon²,

Neubig³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.

show abstract

Section: Pretraining Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Systematic Evaluation of Large Language Models of Code

Xu¹,

Alon²,

Neubig³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…BLEU score is computed as the overlapping fraction of n-grams between the machine-generated text and the reference text. The metric has however been shown to not be a reliable measure for source code Allamanis et al, 2018;Austin et al, 2021). Computational Accuracy @ k (CA@k): Recent work in code synthesis has adopted the CA@k metric (Austin et al, 2021;Roziere et al, 2020) to evaluate code generation models.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…The metric has however been shown to not be a reliable measure for source code Allamanis et al, 2018;Austin et al, 2021). Computational Accuracy @ k (CA@k): Recent work in code synthesis has adopted the CA@k metric (Austin et al, 2021;Roziere et al, 2020) to evaluate code generation models. To compute CA@k, k samples are generated from the model, and the problem is considered solved if any of the generated k samples pass the unit tests associated with the problem.…”

Section: Evaluation Metricsmentioning

confidence: 99%

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Agarwal¹,

Talamadupula²,

Martínez³

et al. 2021

Preprint

View full text Add to dashboard Cite

Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis by applying natural language processing techniques towards automating the code translation task. However, due to the paucity of parallel data in this domain, supervised techniques have only been applied to a limited set of popular programming languages. To bypass this limitation, unsupervised neural machine translation techniques have been proposed to learn code translation using only monolingual corpora. In this work, we propose to use document similarity methods to create noisy parallel datasets of code, thus enabling supervised techniques to be applied for automated code translation without having to rely on the availability or expensive curation of parallel code datasets. We explore the noise tolerance of models trained on such automatically-created datasets and show that these models perform comparably to models trained on ground truth for reasonable levels of noise. Finally, we exhibit the practical utility of the proposed method by creating parallel datasets for languages beyond the ones explored in prior work, thus expanding the set of programming languages for automated code translation.

show abstract

“…Neural models originally developed for natural language processing show promising performance for modeling computer programming languages (Chen et al, 2021;Austin et al, 2021). This has led to a number of interesting applications, and deep-learning models are now successfully and routinely applied in tools that assist developers in writing and understanding programs and code.…”

Section: Introductionmentioning

confidence: 99%

Towards Neural Functional Program Evaluation

Scholak¹,

Pilault²,

Velez-Ginorio³

2021

Preprint

View full text Add to dashboard Cite

This paper explores the capabilities of current transformer-based language models for program evaluation of simple functional programming languages. We introduce a new program generation mechanism that allows control over syntactic sugar for semantically equivalent programs. T5 experiments reveal that neural functional program evaluation performs surprisingly well, achieving high 90% exact program match scores for most in-distribution and out-of-distribution tests. Using pretrained T5 weights has significant advantages over random initialization. We present and evaluate on three datasets to study generalization abilities that are specific to functional programs based on: type, function composition, and reduction steps. Code and data are publicly available at https://github.com/ElementAI/neural-interpreters.

show abstract

Program Synthesis with Large Language Models

Cited by 95 publications

References 0 publications

A Systematic Evaluation of Large Language Models of Code

A Systematic Evaluation of Large Language Models of Code

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Towards Neural Functional Program Evaluation

Contact Info

Product

Resources

About