Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

Simón

2021

IEEE Access

Self Cite

When using code similarity detection to uncover code plagiarism and collusion, the marker needs to determine whether any detected similarities might be the result of coincidence. But understanding the similarities can be difficult and might be prone to human error, because few tools facilitate the investigation process, and if they do, the similarities are not explicitly explained in human language. This paper presents STRANGE, an investigation module that exclusively explains code similarities in natural language (English and Indonesian). For the purpose of reusability, STRANGE can be embedded in JPlag and other code similarity detection tools. It can also act as a standalone tool for measuring source code similarity. Our evaluation shows that STRANGE is more helpful than JPlag in the investigation process since it explains the similarities in natural language. Further, its effectiveness is comparable to that of JPlag but higher on trivial disguises of the sort that novice students will tend to apply when disguising copied code.

Section: Addressing Rq2: Comparing Strange With Jplag In Recognizing Copied Programsmentioning

confidence: 99%

Explanation in Code Similarity Investigation

Simón

2021

IEEE Access

Self Cite

“…The introductory dataset was created by rewriting seven original code files in a Java dataset [28] adapted from Liang's book [31] to eight other programming languages: C++, C#, Javascript, Kotlin, Pascal, Python, R, and Scala. These code files cover seven introductory materials: the output, input, branching, looping, function, array, and matrix.…”

Section: Evaluation Datasetsmentioning

confidence: 99%

“…The same-language dataset is the Java dataset used for creating the introductory dataset [28]. Each original code file was copied and modified according to the last six levels of similarity disguise taxonomy [16] (the first level was excluded, as it depicted no disguises) by nine teaching assistants.…”

Section: Evaluation Datasetsmentioning

confidence: 99%

TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

2020

csci

Self Cite

Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises.

“…Both techniques of raising initial suspicion require lecturer's knowledge about student academic performance. We cannot therefore use publicly available data sets [21], [22], [23], [24] as they have no such information. We create our own data set just for this purpose.…”

Section: Introductionmentioning

confidence: 99%

Initial Suspicion on Detecting Code Plagiarism and Collusion in Academia: Case Study of Algorithm and Data Structure Courses

Ayub

Wijanto

et al. 2021

JITeCS

Self Cite

In engineering education, some assessments require the students to submit program code, and since that code might be a result of plagiarism or collusion, a similarity detection tool is often used to filter excessively similar programs. To improve the scalability of such a tool, it is suggested to initially suspect some programs and only compare those programs to others (instead of exhaustively compare all programs one another). This paper compares the ef-fectiveness of two common techniques to raise such initial suspicion: focusing on the submissions of smart students (as they are likely to be copied), or the submissions of slow-paced students (since those students are likely to breach academic integrity to get higher assessment mark). Our study shows that the latter statistically outperforms the former by 13% in terms of precision; slow-paced students are likely to be the perpetrators, but they fail to get the submissions of smart students.