WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

2020

csci

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises.

Section: Related Workmentioning

confidence: 99%

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

2020

csci

“…Source code token sequence [19][20][21] is a sequence of meaningful "words" from source code; it is usually extracted with the help of a programming language parser. Abstract syntax tree [22] is a tree where the nodes are formed from tokens and the edges are formed from programming language grammar. Program dependency graph [23] is a graph connecting several instructions based on their execution dependencies.…”

Section: Related Workmentioning

confidence: 99%

Automated Hints Generation for Investigating Source Code Plagiarism and Identifying The Culprits on In-Class Individual Programming Assessment

Budiman

2019

Computers

Most source code plagiarism detection tools only rely on source code similarity to indicate plagiarism. This can be an issue since not all source code pairs with high similarity are plagiarism. Moreover, the culprits (i.e., the ones who plagiarise) cannot be differentiated from the victims even though they need to be educated further on different ways. This paper proposes a mechanism to generate hints for investigating source code plagiarism and identifying the culprits on in-class individual programming assessment. The hints are collected from the culprits’ copying behaviour during the assessment. According to our evaluation, the hints from source code creation process and seating position are 76.88% and at least 80.87% accurate for indicating plagiarism. Further, the hints from source code creation process can be helpful for indicating the culprits as the culprits’ codes have at least one of our predefined conditions for the copying behaviour.

“…Structure-based tool defines source code similarity based on given codes' shared structure. The structure can be either source code token sequence [2], [15], [16], low-level token sequence [17], [18], program dependency graph [19] or abstract syntax tree [20]. The similarity of the first two structures are commonly measured by string matching algorithms (e.g., Running-Karp-Rabin Greedy-String-Tiling [5]), that have been modified to handle source code tokens instead of characters.…”

Section: Literature Reviewmentioning

confidence: 99%

A Language-Independent Library for Observing Source Code Plagiarism

Franclinton

2019

JISEBI

Background: Most source code plagiarism detection tools are not modifiable. Consequently, when a modification is required to be applied, a new detection tool should be created along with it. This could be a problem as creating the tool from scratch is time-inefficient while most of the features are similar across source code plagiarism detection tools.Objective: To alleviate researchers' effort, this paper proposes a library for observing two plagiarism-suspected codes (a feature which is similar across most source code plagiarism detection tools).Methods: Unique to this library, it is not constrained by the selected programming language for development. It is executed from command line, which is supported by most programming languages.Results: According to our evaluation, the library is integrable and functional. Moreover, the library can enhance teaching assistants' accuracy and reduce the tasks' completion time.Conclusion: The library can be beneficial for the development of source code plagiarism detection tools since it is integrable, functional, and helpful for teaching assistants.Keywords:Language independency, Plagiarism detection, Reusable library, Source code, Tool development