Cross-Language Clone Detection by Learning Over Abstract Syntax Trees

Pérez, Daniel; Chiba, Shigeru

doi:10.1109/msr.2019.00078

Cited by 44 publications

(38 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4. AST-based cross-language clone detection was proposed by Perez (2019) [24]. The approach is a semi-supervised machine learning model which is capable of detecting cross-language clones by employing a token level vector generation algorithm and tree-based skip-gram algorithm.…”

Section: Related Workmentioning

confidence: 99%

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Ankali¹,

Parthiban²

2021

IJISA

View full text Add to dashboard Cite

A complete and accurate cross-language clone detection tool can support software forking process that reuses the more reliable algorithms of legacy systems from one language code base to other. Cross-language clone detection also helps in building code recommendation system. This paper proposes a new technique to detect and classify cross-language clones of C and C++ programs by filtering the nodes of ANTLR-generated parse tree using a common grammar file, CPP14.g4. Parsing the input files using CPP14.g4 provides all the lexical and semantic information of input source code. Selective filtering of nodes performs serialization of two parse trees. Vector representation using term frequency inverse document frequency (TF-IDF) of the resultant tree is given as an input to cosine similarity to classify the clone types. Filtered parse tree of C and C++ increases the precision from 51% to 61%, and matching based on renaming the input/output expressions provides average precision of 91.97% and 95.37% for small scale and large scale repositories respectively. The proposed cross-language clone detection exhibits the highest precision of 95.37% in finding all types of clones (1, 2, 3 and 4) for 16,032 semantically similar clone pairs of C and CPP codes.

show abstract

Section: Related Workmentioning

confidence: 99%

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Ankali¹,

Parthiban²

2021

IJISA

View full text Add to dashboard Cite

show abstract

“…Peng et al [14] proposed a novel "coding criterion" to build vector representations of nodes in ASTs, which have provided great progress in program analysis. BIGCODE [15] is a tool that can learn AST representations of given source codes with the help of the Skip-gram model [16].…”

Section: Program Vector Embeddingsmentioning

confidence: 99%

“…In this step, each node in ASTs is trained and map to a real-valued vector, which contains each feature of the node. Inspired by BIGCODE tools [15], the Skip-gram model [16] is used to compute node vectors. The principle of this model is to use the currently known nodes to predict the context of them.…”

Section: Program Vector Embeddingsmentioning

confidence: 99%

HELP-DKT: An Interpretable Cognitive Model of How Students Learn Programming Based on Deep Knowledge Tracing

Liang

Peng

et al. 2021

Preprint

View full text Add to dashboard Cite

Student cognitive models are playing an essential role in intelligent online tutoring for programming courses. These models capture students' learning interactions and store them in the form of a set of binary responses, thereby failing to utilize rich educational information in the learning process. Moreover, the recent development of these models has been focused on improving the prediction performance and tended to adopt deep neural networks in building the end-to-end prediction frameworks. Although this approach can provide an improved prediction performance, it may also cause difficulties in interpreting the student's learning status, which is crucial for providing personalized educational feedback. To address this problem, this paper provides an interpretable cognitive model named HELP-DKT, which can infer how students learn programming based on deep knowledge tracing. HELP-DKT has two major advantages. First, it implements a feature-rich input layer, where the raw codes of students are encoded to vector representations, and the error classifications as concept indicators are incorporated. Second, it can infer meaningful estimation of student abilities while reliably predicting future performance. The experiments confirm that HELP-DKT can achieve good prediction performance and present reasonable interpretability of student skills improvement. In practice, HELP-DKT can personalize the learning experience of novice learners.

show abstract

“…e main idea of these approaches is to convert the source code written in different languages into common tree structures, such as eCST (enriched concrete syntax tree) [5], AST [27,28], and CodeDOM (Code Document Object Model) [29]. en, the tree structures are converted into token sequences or vectors to improve the efficiency of similarity measure.…”

Section: Cross-language Source Code Similarity Detection Through Tree-based Intermediate Representationmentioning

confidence: 99%

“…ese approaches also ignore the structural features of the source code. Although the approach proposed in [28] combines the AST and LSTM to detect the similarity between Java and Python code, they are greatly affected by some complex obfuscation technologies, e.g., the commonly used adding redundant statements. Meanwhile, this kind of approach needs to train their models with a lot of code rather than detecting the code similarity directly.…”

Section: Cross-language Source Code Similarity Detectionmentioning

confidence: 99%

Flowchart-Based Cross-Language Source Code Similarity Detection

Feng

Liu

et al. 2020

Scientific Programming

View full text Add to dashboard Cite

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

show abstract

Cross-Language Clone Detection by Learning Over Abstract Syntax Trees

Cited by 44 publications

References 19 publications

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

Detection and Classification of Cross-language Code Clone Types by Filtering the Nodes of ANTLR-generated Parse Tree

HELP-DKT: An Interpretable Cognitive Model of How Students Learn Programming Based on Deep Knowledge Tracing

Flowchart-Based Cross-Language Source Code Similarity Detection

Contact Info

Product

Resources

About