PathPair2Vec: An AST path pair-based code representation method for defect prediction

Shi, Ke; Chang, Jingfei; Wei, Zhen

doi:10.1016/j.cola.2020.100979

Cited by 36 publications

(33 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, a bi-directional network with LSTM is used as model. In [22], the authors propose a model for defect prediction on the base of AST path pair representation. To process the code, the path in the AST is extracted as combination of symbol sequence and control sequence.…”

Section: Long Short Term Memorymentioning

confidence: 99%

“…It is based on the Synthetic Minority Over-Sampling Technique (SMOTE and SMOTUNED) for preparing the datasets and ensemble approaches for classifying the defective and correct code. In [22], the authors takes into account the proportion of the correct and defective code in each project in the dataset. To balance the classes, they duplicate the elements of the smaller class.…”

Section: Lack Of Datamentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Software Defect Prediction Using Deep Learning

et al. 2021

View full text Add to dashboard Cite

Defect prediction is one of the key challenges in software development and programming language research for improving software quality and reliability. The problem in this area is to properly identify the defective source code with high accuracy. Developing a fault prediction model is a challenging problem, and many approaches have been proposed throughout history. The recent breakthrough in machine learning technologies, especially the development of deep learning techniques, has led to many problems being solved by these methods. Our survey focuses on the deep learning techniques for defect prediction. We analyse the recent works on the topic, study the methods for automatic learning of the semantic and structural features from the code, discuss the open problems and present the recent trends in the field.

show abstract

Section: Long Short Term Memorymentioning

confidence: 99%

Section: Lack Of Datamentioning

confidence: 99%

A Survey on Software Defect Prediction Using Deep Learning

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The approach improved the baselines by 3.00, 17.54, 8.77, 14.76 and 8.97%, respectively, on average AUC. Shi et al (2020) built their work based on code2vec, they proposed the PathPair2Vec framework based on Attention Mechanism. The different parts of the terminal node were encoded.…”

Section: E Frameworkmentioning

confidence: 99%

A Systematic Literature Review of Software Defect Prediction Using Deep Learning

Fathy¹,

Abd-Elmegid²,

Bahaa³

et al. 2021

Journal of Computer Science

View full text Add to dashboard Cite

The approaches associated with software defect prediction are used to reduce the time and cost of discovering software defects in source code and to improve the software quality in the organizations. There are two approaches to reveal the software defects in the source code. The first approach is concentrated on the traditional features such as lines of code, code complexity, etc. However, these features fail to extract the semantics of the source code. The second one is concentrated on revealing these semantics. This paper presents a Systematic Literature Review (SLR) of software defect prediction using deep learning models. This SLR is focused on identifying the studies that use the semantics of the source code for improving defect prediction. This SLR aims to analyze the used datasets, models and frameworks. Also, identifying the evaluation metrics to ensure their applicability in software defect prediction. IEEE Xplore, Scopus and Web of Science digital libraries were used to select the suitable primary studies. Forty (40) primary studies were selected that published by 15 December 2020 for analysis based on the quality criteria. The project levels that applied in the studies were: Within-project 52.5%, cross-project 17.5% and both within-project and cross-project 30%. The datasets used were: Promise dataset 68.18% and other datasets 31.82%. The most used deep learning model in the primary studies was: Convolutional Neural Network (CNN) by 35%. The most used evaluation metrics were: F-measure and Area Under the Curve (AUC). Software defect prediction using deep learning models is still a valuable topic and requires much research studies to enhance the performance of the defect prediction.

show abstract

“…To this end, we extend the code2vec model [20] to include paths extracted from various graph representations, mainly AST, CFG, and PDG. We choose code2vec as it is still used for various tasks and most of the recent approaches are built upon it [25], [26]. By extending the code2vec model with CFG and PDG, we can better capture the semantics of the code, which AST alone cannot leverage.…”

Section: Introductionmentioning

confidence: 99%

“…As a first step, we demonstrate that "combining syntactic and semantic paths shows an improvement of 11% over code2vec for the task of METHODNAMING". Moreover, many works are built upon code2vec, such as pathpair2vec [26] and code2seq [25], which outperform code2vec in various tasks. Thus, we believe considering a mocktail of the tree and graph-based structures (AST, CFG, and PDG) can lead to a new direction in representing source code while also improving existing works that rely either solely on ASTs or CFGs and PDGs.…”

Section: Introductionmentioning

confidence: 99%

A Mocktail of Source Code Representations

Swarna¹,

Mathews²,

Vagavolu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Efficient representation of source code is essential for various software engineering tasks such as code search and code clone detection. One such technique for representing source code involves extracting paths from the AST and using a learning model to capture program properties. Code2vec is a commonly used path-based approach that uses an attention-based neural network to learn code embeddings which can then be used for various software engineering tasks. However, this approach uses only ASTs and does not leverage other graph structures such as Control Flow Graphs (CFG) and Program Dependency Graphs (PDG). Similarly, most recent approaches for representing source code still use AST and do not leverage semantic graph structures. Even though there exists an integrated graph approach (Code Property Graph) for representing source code, it has only been explored in the domain of software security. Moreover, it does not leverage the paths from the individual graphs. In our work, we extend the path-based approach code2vec to include semantic graphs, CFG, and PDG, along with AST, which is still largely unexplored in the domain of software engineering. We evaluate our approach on the task of METHODNAMING using a custom C dataset of 730K methods collected from 16 C projects from GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on the full dataset and up to 100% with individual projects. We show that semantic features from the CFG and PDG paths are indeed helpful. We envision that looking at a mocktail of source code representations for various software engineering tasks can lay the foundation for a new line of research and a re-haul of existing research.

show abstract

PathPair2Vec: An AST path pair-based code representation method for defect prediction

Cited by 36 publications

References 11 publications

A Survey on Software Defect Prediction Using Deep Learning

A Survey on Software Defect Prediction Using Deep Learning

A Systematic Literature Review of Software Defect Prediction Using Deep Learning

A Mocktail of Source Code Representations

Contact Info

Product

Resources

About