NatGen: generative pre-training by “naturalizing” source code

Chakraborty, Saikat; Ahmed, Toufique; Ding, Yangruibo; Dévanbu, Prémkumar; Ray, Baishakhi

doi:10.1145/3540250.3549162

Cited by 43 publications

(18 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the optimized and obfuscated code provides precise and formal semantics [8], they tend to be unnatural, introducing data structures and variable names that are not commonly used in human-written programs. Existing studies have argued that such formal but unnatural programs are less favorable to human developers [9,10] and obstruct the code models' learning [11]. Also, ContraCode does not generate semantically contradicting programs as hard negative samples.…”

Section: Discussionmentioning

confidence: 99%

“…Researchers have been passionate about pre-training Transformer models for source code. There are three main architectures for existing models: Encoderonly [6,7,20,30,37,45,75], Decoder-only [4,26,77], and Encoderdecoder [1,11,29,36,62]. Encoder-only models are commonly pretrained with cloze tasks (e.g., masked language model) and sequence understanding tasks (e.g., next statement prediction).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CONCORD: Clone-Aware Contrastive Learning for Source Code

Ding

Chakraborty

Buratti³

et al. 2023

Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

View full text Add to dashboard Cite

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, selfsupervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection.While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning generalpurpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors.Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code. CCS CONCEPTS• Software and its engineering → Language features; • Computing methodologies → Knowledge representation and reasoning.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

CONCORD: Clone-Aware Contrastive Learning for Source Code

Ding

Chakraborty

Buratti³

et al. 2023

Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

View full text Add to dashboard Cite

show abstract

“…Pre-trained NLP models (e.g., BERT [18], RoBERTa [19], BART [20], T5 [21]) use different self-supervised pretraining objectives to learn robust language representations. NLP models have programming language counterparts (e.g., CodeBERT [25], GraphCodeBERT [26], PLBART [27], CodeT5 [13], NatGen [28]) where the models are initialized with the NLP models' weights and continued pre-training with code and associated natural language comments in most cases. Though root cause and mitigation are natural language descriptions, the vocabulary (e.g., identifiers) overlaps more with the comments used in code models.…”

Section: B Openai Models and Baselinesmentioning

confidence: 99%

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Ahmed¹,

Ghosh²,

Bansal³

et al. 2023

Preprint

View full text Add to dashboard Cite

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-ofthe-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

show abstract

“…In previous studies (Ahmad et al 2021;Wang et al 2021b;Niu et al 2022;Chakraborty et al 2022), a prevalent approach is to directly fine-tune pre-trained language models (PLMs) to generate code. However, this approach has a severe limitation, i.e., the generated code may not follow the syntactic rules of the targeted programming language (Dong et al 2022), which can result in the failed compilation.…”

Section: Introductionmentioning

confidence: 99%

A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code Generation

Yang¹,

Zhou²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code is difficult to meet the syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative programs. In this study, we summarize the lack of syntactic constraints into three significant challenges: (1) the efficient representation of syntactic constraints, (2) the effective

show abstract

NatGen: generative pre-training by “naturalizing” source code

Cited by 43 publications

References 32 publications

CONCORD: Clone-Aware Contrastive Learning for Source Code

CONCORD: Clone-Aware Contrastive Learning for Source Code

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code Generation

Contact Info

Product

Resources

About