On the "naturalness" of buggy code

Ray, Baishakhi; Hellendoorn, Vincent J.; Godhane, Saheel; Tu, Zhaopeng; Bacchelli, Alberto; Dévanbu, Prémkumar

doi:10.1145/2884781.2884848

Cited by 197 publications

(144 citation statements)

References 68 publications

Supporting

Mentioning

141

Contrasting

Order By: Relevance

“…Even though it has been noted that buggy code stands out compared to non-buggy code [Ray et al 2016], little work exists on automatically detecting bugs via machine learning. Murali et al train a recurrent neural network that probabilistically models sequences of API calls and then use it for finding incorrect API usages [Murali et al 2017].…”

Section: Machine Learning and Language Models For Analyzing Bugsmentioning

confidence: 99%

DeepBugs: a learning approach to name-based bug detection

Pradel

Sen

2018

Proc. ACM Program. Lang.

283

229

View full text Add to dashboard Cite

Natural language elements in source code, e.g., the names of variables and functions, convey useful information. However, most existing bug detection tools ignore this information and therefore miss some classes of bugs. The few existing name-based bug detection approaches reason about names on a syntactic level and rely on manually designed and tuned algorithms to detect bugs. This paper presents DeepBugs, a learning approach to name-based bug detection, which reasons about names based on a semantic representation and which automatically learns bug detectors instead of manually writing them. We formulate bug detection as a binary classification problem and train a classifier that distinguishes correct from incorrect code. To address the challenge that effectively learning a bug detector requires examples of both correct and incorrect code, we create likely incorrect code examples from an existing corpus of code through simple code transformations. A novel insight learned from our work is that learning from artificially seeded bugs yields bug detectors that are effective at finding bugs in real-world code. We implement our idea into a framework for learning-based and name-based bug detection. Three bug detectors built on top of the framework detect accidentally swapped function arguments, incorrect binary operators, and incorrect operands in binary operations. Applying the approach to a corpus of 150,000 JavaScript files yields bug detectors that have a high accuracy (between 89% and 95%), are very efficient (less than 20 milliseconds per analyzed file), and reveal 102 programming mistakes (with 68% true positive rate) in real-world code.

show abstract

Section: Machine Learning and Language Models For Analyzing Bugsmentioning

confidence: 99%

DeepBugs: a learning approach to name-based bug detection

Pradel

Sen

2018

Proc. ACM Program. Lang.

283

229

View full text Add to dashboard Cite

show abstract

“…We study simplicity since it is very useful to replace N methods with M ≪ N methods, especially when the results from the many are no better than the few. A bewildering array of new methods for software quality prediction are reported each year (some of which rely on intimidatingly complex mathematical methods) such as deep belief net learning [50], spectral-based clustering [55], and n-gram language models [46]. Ghotra et al list dozens of different data mining algorithms that might be used for defect predictors [16].…”

Section: Background and Related Work 21 Why Study Simplification?mentioning

confidence: 99%

Easy over hard: a case study on deep learning

Wei

Menzies

2017

Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

179

107

View full text Add to dashboard Cite

Despite extensive research, many methods in software quality prediction still exhibit some degree of uncertainty in their results. Rather than treating this as a problem, this paper asks if this uncertainty is a resource that can simplify software quality prediction.For example, Deb's principle of ϵ-dominance states that if there exists some ϵ value below which it is useless or impossible to distinguish results, then it is superfluous to explore anything less than ϵ. We say that for "large ϵ problems", the results space of learning effectively contains just a few regions. If many learners are then applied to such large ϵ problems, they would exhibit a "many roads lead to Rome" property; i.e., many different software quality prediction methods would generate a small set of very similar results.This paper explores DART, an algorithm especially selected to succeed for large ϵ software quality prediction problems. DART is remarkable simple yet, on experimentation, it dramatically outperforms three sets of state-of-the-art defect prediction methods.The success of DART for defect prediction begs the questions: how many other domains in software quality predictors can also be radically simplified? This will be a fruitful direction for future work.

show abstract

“…This is done by aggregating those outputs from the descendants, i.e. calling t-lstm() recursively on the children nodes (lines [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. This function first obtains the embedding w t of the input AST node t (using ast2vec as discussed in Section 4.2).…”

Section: Defect Prediction Modelmentioning

confidence: 99%

“…[8]) and the line level (e.g. [24]). Since our approach is able to learn features at the code token level, it may work at those finer level of granularity.…”

Section: Related Work 71 Defect Predictionmentioning

confidence: 99%

Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice

Dam

Pham

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

121

View full text Add to dashboard Cite

Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models. In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction. Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions. CCS CONCEPTS• Software and its engineering → Software creation and management; KEYWORDSSoftware engineering, software analytics, defect prediction ACM Reference Format:

show abstract

On the "naturalness" of buggy code

Cited by 197 publications

References 68 publications

DeepBugs: a learning approach to name-based bug detection

DeepBugs: a learning approach to name-based bug detection

Easy over hard: a case study on deep learning

Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice

Contact Info

Product

Resources

About