Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Enginee 2022
DOI: 10.1145/3540250.3549162
|View full text |Cite
|
Sign up to set email alerts
|

NatGen: generative pre-training by “naturalizing” source code

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 43 publications
(18 citation statements)
references
References 32 publications
0
18
0
Order By: Relevance
“…While the optimized and obfuscated code provides precise and formal semantics [8], they tend to be unnatural, introducing data structures and variable names that are not commonly used in human-written programs. Existing studies have argued that such formal but unnatural programs are less favorable to human developers [9,10] and obstruct the code models' learning [11]. Also, ContraCode does not generate semantically contradicting programs as hard negative samples.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…While the optimized and obfuscated code provides precise and formal semantics [8], they tend to be unnatural, introducing data structures and variable names that are not commonly used in human-written programs. Existing studies have argued that such formal but unnatural programs are less favorable to human developers [9,10] and obstruct the code models' learning [11]. Also, ContraCode does not generate semantically contradicting programs as hard negative samples.…”
Section: Discussionmentioning
confidence: 99%
“…Researchers have been passionate about pre-training Transformer models for source code. There are three main architectures for existing models: Encoderonly [6,7,20,30,37,45,75], Decoder-only [4,26,77], and Encoderdecoder [1,11,29,36,62]. Encoder-only models are commonly pretrained with cloze tasks (e.g., masked language model) and sequence understanding tasks (e.g., next statement prediction).…”
Section: Related Workmentioning
confidence: 99%
“…Pre-trained NLP models (e.g., BERT [18], RoBERTa [19], BART [20], T5 [21]) use different self-supervised pretraining objectives to learn robust language representations. NLP models have programming language counterparts (e.g., CodeBERT [25], GraphCodeBERT [26], PLBART [27], CodeT5 [13], NatGen [28]) where the models are initialized with the NLP models' weights and continued pre-training with code and associated natural language comments in most cases. Though root cause and mitigation are natural language descriptions, the vocabulary (e.g., identifiers) overlaps more with the comments used in code models.…”
Section: B Openai Models and Baselinesmentioning
confidence: 99%
“…In previous studies (Ahmad et al 2021;Wang et al 2021b;Niu et al 2022;Chakraborty et al 2022), a prevalent approach is to directly fine-tune pre-trained language models (PLMs) to generate code. However, this approach has a severe limitation, i.e., the generated code may not follow the syntactic rules of the targeted programming language (Dong et al 2022), which can result in the failed compilation.…”
Section: Introductionmentioning
confidence: 99%