A comparison of code similarity analysers

Ragkhitwetsagul, Chaiyong; Krinke, Jens; Clark, David

doi:10.1007/s10664-017-9564-7

Cited by 97 publications

(58 citation statements)

References 90 publications

(132 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is true that treating source code as text enables the detection of cross-language plagiarism and collusion with minimal effort. Nevertheless, this treatment may reduce the detection accuracy [44]; the source code can be inaccurately parsed since source code grammars are different from text grammars. For instance, statement countMAX+=1; can be considered to be one word according to text grammars since no spaces are involved between the tokens.…”

Section: Related Workmentioning

confidence: 99%

“…Some detection techniques have addressed the issue by considering the source code as raw text [7,50], removing any needs for language-specific components. Even though this kind of approach is applicable, it may lack effectiveness [43,44]. Occasionally, a given source code can be inaccurately tokenized since text grammars are different from source code grammars.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

Karnalim

2020

csci

View full text Add to dashboard Cite

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

Karnalim

2020

csci

View full text Add to dashboard Cite

show abstract

“…The first data set, called the generated data set, is used in our previous study of comparing 30 code similarity analysers [19]. It contains 100 Java source code files with pervasive code modifications.…”

Section: E Data Setsmentioning

confidence: 99%

“…It preprocesses source code before detecting clones by using pretty-printing, variable renaming, and code abstraction. We chose NiCad because it has been used in several clone studies [19], [26], [22] and it reports clones at method-level, similar to Vincent. Both Vincent and NiCad were configured with the default configurations.…”

Section: G Experimental Designmentioning

confidence: 99%

See 1 more Smart Citation

A picture is worth a thousand words: Code clone detection based on image similarity

Ragkhitwetsagul

Krinke

Marnette³

2018

2018 IEEE 12th International Workshop on Software Clones (IWSC)

Self Cite

View full text Add to dashboard Cite

Abstract-This paper introduces a new code clone detection technique based on image similarity. The technique captures visual perception of code seen by humans in an IDE by applying syntax highlighting and images conversion on raw source code text. We compared two similarity measures, Jaccard and earth mover's distance (EMD) for our image-based code clone detection technique. Jaccard similarity offered better detection performance than EMD. The F1 score of our technique on detecting Java clones with pervasive code modifications is comparable to five well-known code clone detectors: CCFinderX, Deckard, iClones, NiCad, and Simian. A Gaussian blur filter is chosen as a normalisation technique for type-2 and type-3 clones. We found that blurring code images before similarity computation resulted in higher precision and recall. The detection performance after including the blur filter increased by 1 to 6 percent. The manual investigation of clone pairs in three software systems revealed that our technique, while it missed some of the true clones, could also detect additional true clone pairs missed by NiCad.

show abstract

Layered similarity detection for programming plagiarism and collusion on weekly assessments

Karnalim

Simón²,

Chivers

2022

Comp Applic In Engineering

View full text Add to dashboard Cite

When weekly programming assessments are used, it is often the case that some of them are either trivial or strongly directed. Common code similarity detectors are not particularly helpful with such assessments: some potential instances of misconduct are not selected for manual investigation as all submissions are expected to be similar and it is not feasible to check them all. Several dedicated similarity detectors have been developed to work with such assessments, but the experience is required to determine when to use them. This paper presents a similarity detector that works on many kinds of weekly assessments. It combines three‐layered types of similarity so that even within a set of highly similar submissions, program pairs are still sorted according to their levels of similarity. Our similarity detector is more effective than JPlag in distinguishing similar programs and helping to identify plagiarism and collusion. The similarity detector is slower than JPlag, but the longer execution time is partly offset by some optimization that has no negative impact on the effectiveness. As weekly assessments seldom entail large submissions, the execution time does not appear to be a barrier to use.

show abstract

A comparison of code similarity analysers

Cited by 97 publications

References 90 publications

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

A picture is worth a thousand words: Code clone detection based on image similarity

Layered similarity detection for programming plagiarism and collusion on weekly assessments

Contact Info

Product

Resources

About