Large‐scale inter‐system clone detection using suffix trees and hashing

Program source code is one of the main targets of software engineering research. A wide variety of research has been conducted on source code, and many studies have leveraged structural, vocabulary, and method signature similarities to measure the functional sameness of source code. In this research, we conducted an empirical study to ascertain how we should use three similarities to measure functional sameness. We used two large datasets and measured the three similarities between all the method pairs in the datasets, each of which included approximately 15 million Java method pairs. The relationships between the three similarities were analyzed to determine how we should use each to detect functionally similar code. The results of our study revealed the following.(1) Method names are not always useful for detecting functionally similar code. Only if there are a small number of methods having a given name, the methods are likely to include functionally similar code.(2) Existing file-level, method-level, and block-level clone detection techniques often miss functionally similar code generated by copy-and-paste operations between different projects. (3) In the cases we use structural similarity for detecting functionally similar code, we obtained many false positives. However, we can avoid detecting most false positives by using a vocabulary similarity in addition to a structural one. (4) Using a vocabulary similarity to detect functionally similar code is not suitable for method pairs in the same file because such method pairs use many of the same program elements such as private methods or private fields.

show abstract

“…They can be classified into two categories, fine-grained detection [17,23,27,39,41] and unit-level detection [19,35,40].…”

Section: Identifying Code Reuse Between Different Projectsmentioning

confidence: 99%

How should we measure functional sameness from program source code? an exploratory study on Java methods

Higo

Kusumoto

2014

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

View full text Add to dashboard Cite

show abstract

“…Dedicated code search engines such as BlackDuck OpenHub (BlackDuck, 2016), Krugle (Aragon Consulting Group, Inc., 2018) or Searchcode (Boyter, Ben, 2018) cannot efficiently handle code clones with modifications . Hummel et al (2010) and Koschke (2014) are among the first to propose scalable clone detection systems. However, the trade-off for the scalability is their ability to report only copy-and-paste clones or clones with variable renaming (i.e., Type-1 and Type-2 clones), while the largest number of clones found in software are clones with added or deleted statements (i.e., Type-3 clones) .…”

Section: Background and Motivationmentioning

confidence: 99%

“…Thus, adding new projects to the code base under analysis or updating existing projects would result in the need to rerun the clone detection for the complete data set. Several of the proposed techniques that support incremental clone detection do not scale to large-scale data sets (Göde and Koschke, 2009;Kawaguchi et al, 2009;Nguyen et al, 2009) or do not detect Type-3 clones in sacrificing for scalability (Hummel et al, 2010;Koschke, 2014).…”

Section: Background and Motivationmentioning

confidence: 99%

“…It offers a short query response time and is scalable to large-scale data sets (Manning et al, 2009). It has been used in scalable code clone detection tools and techniques, such as Hummel et al (2010), Koschke (2014), Sajnani et al (2016), , and Saini et al (2018), and has shown to offer high scalability on large code corpora.…”

Section: Background and Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Siamese: scalable and incremental code clone search via multiple code representations

Ragkhitwetsagul

Krinke

2019

Empir Software Eng

View full text Add to dashboard Cite

This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.

show abstract

“…After a hash value is computed for each input source file, code clones are retrieved from the databases where the hash values are stored. Koschke proposed code clone detection approach using suffix tree and MD5 hash function [20]. The goal of his research is to detect code clones between a subject systems and a set of other systems for finding potential license violations.…”

Section: Related Workmentioning

confidence: 99%

Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files

Choi

Yoshida

Higo

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYSo far, many approaches for detecting code clones have been proposed based on the different degrees of normalizations (e.g. removal of white spaces, tokenization, and regularization of identifiers). Different degrees of normalizations lead to different granularities of source code to be detect as code clones. To investigate how the normalizations impact the code clone detection, this study proposes six approaches for detecting code clones with preprocessing input source files using different degrees of normalizations. More precisely, each normalization is applied to the input source files and then equivalence class partitioning is performed to the files in the preprocessing. After that, code clones are detected from a set of files that are representatives of each equivalence class using a token-based code clone detection tool named CCFinder. The proposed approaches can be categorized into two types, approaches with non-normalization and normalization. The former is the detection of only identical files without any normalization. Meanwhile, the latter category is the detection of identical files with different degrees of normalizations such as removal of all lines containing macros. From the case study, we observed that our proposed approaches detect code clones faster than the approach that uses only CCFinder. We also found the approach with nonnormalization is the fastest among the proposed approaches in many cases.

show abstract

Large‐scale inter‐system clone detection using suffix trees and hashing

Cited by 19 publications

References 35 publications

How should we measure functional sameness from program source code? an exploratory study on Java methods

How should we measure functional sameness from program source code? an exploratory study on Java methods

Siamese: scalable and incremental code clone search via multiple code representations

Proposing and Evaluating Clone Detection Approaches with Preprocessing Input Source Files

Contact Info

Product

Resources

About