On the naturalness of software

Hindle,; Barr, Earl T.; Su, Zhendong; Gabel,; Dévanbu, Prémkumar

doi:10.1109/icse.2012.6227135

Cited by 592 publications

(659 citation statements)

References 36 publications

Supporting

Mentioning

641

Contrasting

Unclassified

Order By: Relevance

“…In those areas, they are used for ranking candidate sentences, such as candidate translations of a foreign language sentence, based on how natural they are in the target language. To our knowledge, Hindle et al [1] were the first to apply language models to source code.…”

Section: Language Models For Programming Languagesmentioning

confidence: 99%

See 1 more Smart Citation

Mining source code repositories at massive scale using language modeling

Allamanis

Sutton

2013

2013 10th Working Conference on Mining Software Repositories (MSR)

235

261

View full text Add to dashboard Cite

Abstract-The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new "lens" for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program's core logic based solely on general information theoretic criteria.

show abstract

Section: Language Models For Programming Languagesmentioning

confidence: 99%

“…Recently, Hindle et al [1] presented pioneering work in learning language models over source code, that represent broad statistical characteristics of coding style. Language models (LMs) are simply probability distributions over strings.…”

Section: Introductionmentioning

confidence: 99%

Mining source code repositories at massive scale using language modeling

Allamanis

Sutton

2013

2013 10th Working Conference on Mining Software Repositories (MSR)

235

261

View full text Add to dashboard Cite

show abstract

“…While there are many ways in which tokens in a block can be ordered e.g., alphabetical order, length of tokens, occurance frequency of token in a corpus, etc., a natural question is what order is most effective in this context. As it turns out, software vocabulary exhibits very similar characteristics to natural languages corpus and also follow Zipf's law [13,40]. That is, there are few very popular (frequent) tokens, and the frequency of tokens decreases very rapidly with rank.…”

Section: Sub-block Overlap Filteringmentioning

confidence: 99%

SourcererCC

Sajnani

Saini

Svajlenko

et al. 2016

Proceedings of the 38th International Conference on Software Engineering

314

View full text Add to dashboard Cite

Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized invertedindex to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone.We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

show abstract

“…The mental model of the programmer may be something like a language model for speech, but rather applied to code. Language models are typically applied to natural human utterances but they have also been successfully applied to software (Hindle et al, 2012;Raychev et al, 2014;White et al, 2015), and can be used to discover unexpected segments of tokens in source code (Campbell et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

“…Thus GrammarGuru uses language models to capture code regularity or naturalness and then looks for irregular code (Campbell et al, 2014). Once the location of a potential error is found, code completion techniques that exploit language models (Hindle et al, 2012;Raychev et al, 2014;White et al, 2015) can be used to suggest possible fixes. Traditional parsers do not rely upon such information.…”

Section: Introductionmentioning

confidence: 99%

Finding and correcting syntax errors using recurrent neural networks

Santos

Campbell

Hindle

et al. 2017

Preprint

Self Cite

View full text Add to dashboard Cite

Minor syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of intuition that help them resolve these tiny errors. Standard LR parsers typically resolve syntax errors and their precise location poorly. We propose a methodology that helps locate where syntax errors occur, but also suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by checking if two language models "agree" on each token. If the models disagree, it indicates a possible syntax error; the methodology tries to suggest a fix by finding an alternative token sequence obtained from the models. We trained two LSTM (Long short-term memory) language models on a large corpus of JavaScript code collected from GitHub. The dual LSTM neural network model predicts the correct location of the syntax error 54.74% in its top 4 suggestions and produces an exact fix up to 35.50% of the time. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

show abstract

On the naturalness of software

Cited by 592 publications

References 36 publications

Mining source code repositories at massive scale using language modeling

Mining source code repositories at massive scale using language modeling

SourcererCC

Finding and correcting syntax errors using recurrent neural networks

Contact Info

Product

Resources

About