Structural and Nominal Cross-Language Clone Detection

Nichols, Lawton; Emre, Mehmet; Hardekopf, Ben

doi:10.1007/978-3-030-16722-6_14

Cited by 13 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Prior work has shown that identifiers impact source code comprehension, especially for beginners [14], and as developers must understand the code returned by search, tokens are an important consideration. Prior work in code-to-code search that relies on ASTs have seen high precision and recall [43,63] suggesting that is an important consideration as well. Individually, each measure has shortcomings.…”

Section: Motivationmentioning

confidence: 99%

“…Techniques that use static code attributes to compute similarity often parse code into an intermediate representation based on text [7,36,47], AST [11,34] or graph-based [26,46] and compute a measure for syntactic similarity. For cross-language syntactic similarity, most techniques are text-based [43,56,58]. Tree-and graph-based approaches have not been explored for cross-language similarity due to language specific grammar.…”

Section: Code Similaritymentioning

confidence: 99%

See 1 more Smart Citation

Cross-language code search using static and dynamic analyses

Mathew

Stolee

2021

Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw

View full text Add to dashboard Cite

As code search permeates most activities in software development, code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data.We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146 Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-tocode search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical, multi-language code-to-code search. CCS CONCEPTS• Software and its engineering → Software maintenance tools; • Information systems → Similarity measures.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Code Similaritymentioning

confidence: 99%

Cross-language code search using static and dynamic analyses

Mathew

Stolee

2021

Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw

View full text Add to dashboard Cite

show abstract

“…Several recent studies have reported on cross-language code clone detection [58,88,59]. For example, LICCA, a tool for cross-language clone detection [82] is based on a tree-based intermediate representation of the source code.…”

Section: Extensibility Of Experiments Testbed For Software Engineering Experimentsmentioning

confidence: 99%

What is Your Code Clone Detection and Evolution Research Made Of?

Wijesiriwardana

Wimalaratne

2021

cai

View full text Add to dashboard Cite

Over the past few decades, clone detection and evolution have become a major area of study in software engineering. Clone detection experiments present several challenges to researchers such as accurate data collection, selecting proper code detection algorithms, and understanding clone evolution phenomena. This paper attempts to facilitate clone detection and evolution research by providing a structured and systematic mechanism to conduct experiments. Clone detection experiments usually consist of several tasks such as fetching data from a version control system, performing necessary pre-processing activities, and feeding the data to a clone detection algorithm. Therefore, a particular clone detection experiment can interpret as a meaningful combination of such tasks into a scientific workflow. In this work, the concrete tasks in a code clone detection workflow are referred to as Building Blocks. This paper presents a useful collection of Building Blocks identified based on a systematic literature review, and a conceptual framework of an experimental testbed to facilitate clone detection experiments. The reusability of the Building Blocks was validated using four case studies selected from the literature. The validation results confirm the reusability and the expressiveness of the Building Blocks in new ventures. Besides, the proposed experimental testbed is proven beneficial in conducting and replicating clone detection experiments.

show abstract

“…e main idea of these approaches is to convert the source code written in different languages into common tree structures, such as eCST (enriched concrete syntax tree) [5], AST [27,28], and CodeDOM (Code Document Object Model) [29]. en, the tree structures are converted into token sequences or vectors to improve the efficiency of similarity measure.…”

Section: Cross-language Source Code Similarity Detection Through Tree-based Intermediate Representationmentioning

confidence: 99%

Flowchart-Based Cross-Language Source Code Similarity Detection

Feng

Liu

et al. 2020

Scientific Programming

View full text Add to dashboard Cite

Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

show abstract

Structural and Nominal Cross-Language Clone Detection

Cited by 13 publications

References 17 publications

Cross-language code search using static and dynamic analyses

Cross-language code search using static and dynamic analyses

What is Your Code Clone Detection and Evolution Research Made Of?

Flowchart-Based Cross-Language Source Code Similarity Detection

Contact Info

Product

Resources

About