Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Karampatsis, Rafael-Michael; Babii, Hlib; Robbes, Romain; Sutton, Charles; Janes, Andrea

doi:10.48550/arxiv.2003.07914

Cited by 11 publications

(19 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While Bengio et al [11] reported worse results when using this estimator compared to straight-through, we found the latter to fail on our task. 4: Relation prediction model produces high quality edges on the validation and test splits of the Python corpus [37]. The model has an F score above 0.8 for all edge types, and attains an F score close to 1.0 for the most frequently occurring types.…”

Section: Backpropagating Through Predicted Relationsmentioning

confidence: 99%

“…We adopted the Python source code dataset used by Karampatsis and Sutton [36] that is comprised of two training splits (one small and one large), a validation, a test, and an encoding split used only for learning a tokenization. The corpus sanitized by Karampatsis et al [37] is licensed under CC BY 4.0. Dataset Preprocessing.…”

Section: Code Completionmentioning

confidence: 99%

“…We also converted all files in Python2 to Python3 using Python's automated translation tool lib2to3. To reduce the size of the vocabulary, we followed Karampatsis et al [37] in splitting tokens into subtokens with the byte-pair encoding algorithm [64] ran for 10k merges using the Tokenizer package from Hugging Face. While the model presented in Section 3.3 is in principle able to handle tokens that consist of an arbitrary number of subwords, we truncate each token to have a maximum of 6 subwords for computational efficiency.…”

Section: Code Completionmentioning

confidence: 99%

“…Since many files are longer than the typical context size of a Transformer-based model, which parts of a file should one train with becomes a question. Unlike Karampatsis et al [37], we did not take a fixed size prefix of each file. Instead, we sampled uniformly a new window each time a file is selected for a gradient update.…”

Section: Code Completionmentioning

confidence: 99%

“…We do not compare directly against results in [37], since our setting is not strictly comparable to theirs due to subtle differences in data preprocessing (e.g. the specific procedure of creating subwords is different).…”

Section: Task Modelmentioning

confidence: 99%

See 4 more Smart Citations

Learning to Extend Program Graphs to Work-in-Progress Code

Li,

Maddison,

Tarlow

2021

Preprint

View full text Add to dashboard Cite

Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since highperforming models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to workin-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks.

show abstract

Section: Backpropagating Through Predicted Relationsmentioning

confidence: 99%

Section: Code Completionmentioning

confidence: 99%

Section: Code Completionmentioning

confidence: 99%

Section: Code Completionmentioning

confidence: 99%

Section: Task Modelmentioning

confidence: 99%

See 3 more Smart Citations

Learning to Extend Program Graphs to Work-in-Progress Code

Li,

Maddison,

Tarlow

2021

Preprint

View full text Add to dashboard Cite

show abstract

Sentiment Analysis of Reviews in Natural Language: Roman Urdu as a Case Study

Qureshi

Asif

Hassan³

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Opinion Mining from user reviews is an emerging field. Sentiment Analysis of Natural Language text helps us in finding the opinion of the customers. These reviews can be in any language e.g. English, Chinese, Arabic, Japanese, Urdu, and Hindi. This research presents a model to classify the polarity of the review(s) in Roman Urdu text (reviews). For the purpose, raw data was scraped from the reviews of 20 songs from Indo-Pak Music Industry. In this research a new dataset of 24000 reviews of Roman Urdu text is created. Nine Machine Learning algorithms-Naïve Bayes, Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Artificial Neural Networks, Convolutional Neural Network, Recurrent Neural Networks, ID3 and Gradient Boost Tree, are attempted. Logistic Regression outperformed the rest, based on testing and cross validation accuracies that are 92.25% and 91.47% respectively.

show abstract

Mining Software Repositories with a Collaborative Heuristic Repository

Babii

Prenner

Stricker

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

Self Cite

View full text Add to dashboard Cite

Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collaborative heuristic repository, to which researchers can contribute a large amount of diverse heuristics for a variety of tasks on a variety of SE artifacts. These heuristics are then leveraged by state-of-the-art weak supervision techniques to train high-quality classifiers, thus improving the categorizations. We present an initial version of the heuristic repository, which we applied to the concrete task of commit classification.

show abstract

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Cited by 11 publications

References 42 publications

Learning to Extend Program Graphs to Work-in-Progress Code

Learning to Extend Program Graphs to Work-in-Progress Code

Sentiment Analysis of Reviews in Natural Language: Roman Urdu as a Case Study

Mining Software Repositories with a Collaborative Heuristic Repository

Contact Info

Product

Resources

About