2013 10th Working Conference on Mining Software Repositories (MSR) 2013
DOI: 10.1109/msr.2013.6624029
|View full text |Cite
|
Sign up to set email alerts
|

Mining source code repositories at massive scale using language modeling

Abstract: Abstract-The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
263
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 236 publications
(266 citation statements)
references
References 10 publications
3
263
0
Order By: Relevance
“…In selecting projects for our dataset, we scanned-in no particular order or priority-the JAVA projects in GitHub's Trending repositories 7 and the list provided by the GitHub JAVA Project due to Charles and Allamanis [6] 8 , selecting projects that match the above criteria. Tab.…”
Section: The Datasetmentioning
confidence: 99%
“…In selecting projects for our dataset, we scanned-in no particular order or priority-the JAVA projects in GitHub's Trending repositories 7 and the list provided by the GitHub JAVA Project due to Charles and Allamanis [6] 8 , selecting projects that match the above criteria. Tab.…”
Section: The Datasetmentioning
confidence: 99%
“…Plus, it does not require static analysis that would be impossible in dynamic languages such as Python and JavaScript. Allamanis and Sutton [6] found that, with a much larger model trained on over one billion tokens, the n-gram language model performed "significantly better at the code suggestion task than previous models". This implies that all that is needed to generated better suggestions and to mitigate the problems of decreased accuracy due to predicting across project domains is to increase the corpus size.…”
Section: Methodsmentioning
confidence: 99%
“…Since I pushed the first commit to atom-gamboge 6 in October 2014 to the time of this writing, Atom has been through https://github.com/atom/atom/releases, with several APIbreaking changes. The documentation for Atom have been needing, at best.…”
Section: Retrospectivementioning
confidence: 99%
“…This minimizes the size of the vocabulary, but leads to long sequences which are harder to extract structure from. In [18], a token based vocabulary is used. This leads to shorter sequences, but naive tokenization causes an explosion in the size of the vocabulary, as every identifier and literal must be represented uniquely.…”
Section: Language Modelmentioning
confidence: 99%