Efficient plagiarism detection for large code repositories

Burrows, Steven; Tahaghoghi, S. M. M.; Zobel, Justin

doi:10.1002/spe.750

Cited by 84 publications

(84 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It has been defined as one of the code cloning patterns by (Kapser 2006;Kapser and Godfrey 2008). Boiler-plate code can be found when building device drivers for operating systems (Baxter et al 1998), developing android applications , and giving programming assignments (Burrows et al 2007;Schleimer et al 2003). Boiler-plate code usually contains small code modifications in order to adapt the boiler-plate code to a new environment.…”

Section: Source Code Modificationsmentioning

confidence: 99%

“…Different similarity measurements such as suffix trees, string alignment, Jaccard similarity, etc., can be applied to sequences or sets of tokens. Tools that rely on tokens include Sherlock (Joy and Luck 1999), BOSS (Joy et al 2005), Sim (Gitchell and Tran 1999), YAP3 (Wise 1996), JPlag (Prechelt et al 2002), CCFinder (Kamiya et al 2002), CP-Miner (Li et al 2006), MOSS (Schleimer et al 2003), Burrows et al (2007), and the Source Code Similarity Detector System (SCSDS) (Duric and Gasevic 2013). The token-based representation is widely used in source code similarity measurement and very efficient on a scale of millions SLOC.…”

Section: Code Similarity Measurementmentioning

confidence: 99%

See 1 more Smart Citation

A comparison of code similarity analysers

2017

View full text Add to dashboard Cite

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

show abstract

Section: Source Code Modificationsmentioning

confidence: 99%

Section: Code Similarity Measurementmentioning

confidence: 99%

A comparison of code similarity analysers

2017

View full text Add to dashboard Cite

show abstract

“…These approaches are text-based, attribute-based, and structure-based approach [4,5]. Text-based approach is the only approach which is programming-independent since it treats source code as raw text.…”

Section: Related Workmentioning

confidence: 99%

“…In general, there are three major approaches for detecting source code plagiarism: text-based, attribute-based, and structure-based approach [4,5]. Text-based approach determines similarity by considering source code as a raw text; attribute-based approach determines similarity based on source code attributes (e.g.…”

Section: Introductionmentioning

confidence: 99%

Detecting Source Code Plagiarism on .NET Programming Languages using Low-level Representation and Adaptive Local Alignment

Rabbani¹,

Karnalim²

2017

J. inf. organ. sci. (Online)

View full text Add to dashboard Cite

Even though there are various source code plagiarism detection approaches, only a few works which are focused on low-level representation for deducting similarity. Most of them are only focused on lexical token sequence extracted from source code. In our point of view, low-level representation is more beneficial than lexical token since its form is more compact than the source code itself. It only considers semantic-preserving instructions and ignores many source code delimiter tokens. This paper proposes a source code plagiarism detection which rely on low-level representation. For a case study, we focus our work on .NET programming languages with Common Intermediate Language as its low-level representation. In addition, we also incorporate Adaptive Local Alignment for detecting similarity. According to Lim et al, this algorithm outperforms code similarity state-of-the-art algorithm (i.e. Greedy String Tiling) in term of effectiveness. According to our evaluation which involves various plagiarism attacks, our approach is more effective and efficient when compared with standard lexical-token approach.

show abstract

“…However, it is common that unethical students plagiarize other people's work. For example, an Australian survey conducted in 2002 reported that 85% of a class at Monash University, and 70% of a class at Swinburne University, engaged in plagiarism during their studies [3].…”

Section: Introductionmentioning

confidence: 99%