Mining source code repositories at massive scale using language modeling

Allamanis, Miltiadis; Sutton, Charles

doi:10.1109/msr.2013.6624029

Cited by 236 publications

(266 citation statements)

References 10 publications

Supporting

Mentioning

263

Contrasting

Order By: Relevance

“…In selecting projects for our dataset, we scanned-in no particular order or priority-the JAVA projects in GitHub's Trending repositories 7 and the list provided by the GitHub JAVA Project due to Charles and Allamanis [6] 8 , selecting projects that match the above criteria. Tab.…”

Section: The Datasetmentioning

confidence: 99%

When do Software Complexity Metrics Mean Nothing? – When Examined out of Context.

Gil¹,

Lalouche²

2016

JOT

View full text Add to dashboard Cite

This paper places its attention on a familiar phenomena: that code metrics such as lines of code are extremely context dependent and their distribution differs from project to project. We apply visual inspection, as well as statistical reasoning and testing, to show that such metric values are so sensitive to context, that their measurement in one project offers little prediction regarding their measurement in another project.On the positive side, we show that context bias can be neutralized, at least for the majority of metrics that we considered, by what we call Log Normal Standardization (LNS). Concretely, the LNS transformation is obtained by shifting (by subtracting the mean) and scaling (by dividing by the standard deviation) of the log of a metric value.Thus, we conclude that the LNS-transformed-, are to be preferred over the plain-, values of metrics, especially in comparing modules from different projects. Conversely, the LNS-transformation suggests that the "context bias" of a software project with respect to a specific metric can be summarized with two numbers: the mean of the logarithm of the metric value, and its standard deviation.

show abstract

Section: The Datasetmentioning

confidence: 99%

When do Software Complexity Metrics Mean Nothing? – When Examined out of Context.

Gil¹,

Lalouche²

2016

JOT

View full text Add to dashboard Cite

show abstract

“…Plus, it does not require static analysis that would be impossible in dynamic languages such as Python and JavaScript. Allamanis and Sutton [6] found that, with a much larger model trained on over one billion tokens, the n-gram language model performed "significantly better at the code suggestion task than previous models". This implies that all that is needed to generated better suggestions and to mitigate the problems of decreased accuracy due to predicting across project domains is to increase the corpus size.…”

Section: Methodsmentioning

confidence: 99%

“…Since I pushed the first commit to atom-gamboge 6 in October 2014 to the time of this writing, Atom has been through https://github.com/atom/atom/releases, with several APIbreaking changes. The documentation for Atom have been needing, at best.…”

Section: Retrospectivementioning

confidence: 99%

Multi-token code suggestions using statistical language models

Santos

Hindle

2015

Preprint

View full text Add to dashboard Cite

show abstract

“…This minimizes the size of the vocabulary, but leads to long sequences which are harder to extract structure from. In [18], a token based vocabulary is used. This leads to shorter sequences, but naive tokenization causes an explosion in the size of the vocabulary, as every identifier and literal must be represented uniquely.…”

Section: Language Modelmentioning

confidence: 99%

End-to-End Deep Learning of Optimization Heuristics

Cummins

Petoumenos

Wang

et al. 2017

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

132

151

View full text Add to dashboard Cite

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-theart predictive models using hand-crafted features, providing on average 16% and 12% more performance with no human effort expended on designing features.

show abstract

Mining source code repositories at massive scale using language modeling

Cited by 236 publications

References 10 publications

When do Software Complexity Metrics Mean Nothing? – When Examined out of Context.

When do Software Complexity Metrics Mean Nothing? – When Examined out of Context.

Multi-token code suggestions using statistical language models

End-to-End Deep Learning of Optimization Heuristics

Contact Info

Product

Resources

About