2020
DOI: 10.48550/arxiv.2003.07914
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Rafael-Michael Karampatsis,
Hlib Babii,
Romain Robbes
et al.

Abstract: Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading t… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 11 publications
(19 citation statements)
references
References 42 publications
0
19
0
Order By: Relevance
“…While Bengio et al [11] reported worse results when using this estimator compared to straight-through, we found the latter to fail on our task. 4: Relation prediction model produces high quality edges on the validation and test splits of the Python corpus [37]. The model has an F score above 0.8 for all edge types, and attains an F score close to 1.0 for the most frequently occurring types.…”
Section: Backpropagating Through Predicted Relationsmentioning
confidence: 99%
See 4 more Smart Citations
“…While Bengio et al [11] reported worse results when using this estimator compared to straight-through, we found the latter to fail on our task. 4: Relation prediction model produces high quality edges on the validation and test splits of the Python corpus [37]. The model has an F score above 0.8 for all edge types, and attains an F score close to 1.0 for the most frequently occurring types.…”
Section: Backpropagating Through Predicted Relationsmentioning
confidence: 99%
“…We adopted the Python source code dataset used by Karampatsis and Sutton [36] that is comprised of two training splits (one small and one large), a validation, a test, and an encoding split used only for learning a tokenization. The corpus sanitized by Karampatsis et al [37] is licensed under CC BY 4.0. Dataset Preprocessing.…”
Section: Code Completionmentioning
confidence: 99%
See 3 more Smart Citations