2019
DOI: 10.15388/infedu.2019.15
|View full text |Cite
|
Sign up to set email alerts
|

Source Code Plagiarism Detection in Academia with Information Retrieval: Dataset and the Observation

Abstract: Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2

Relationship

4
5

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 42 publications
(67 reference statements)
0
15
0
Order By: Relevance
“…Recall is the proportion of suspected program pairs that are copied to all copied pairs. To perform this comparison we used the IR-Plag data set [52]. This data set is based on seven initial Java programs that cover introductory programming materials (output, input, branching, looping, function, array, and matrix).…”
Section: Addressing Rq2: Comparing Strange With Jplag In Recognizing Copied Programsmentioning
confidence: 99%
“…Recall is the proportion of suspected program pairs that are copied to all copied pairs. To perform this comparison we used the IR-Plag data set [52]. This data set is based on seven initial Java programs that cover introductory programming materials (output, input, branching, looping, function, array, and matrix).…”
Section: Addressing Rq2: Comparing Strange With Jplag In Recognizing Copied Programsmentioning
confidence: 99%
“…The introductory dataset was created by rewriting seven original code files in a Java dataset [28] adapted from Liang's book [31] to eight other programming languages: C++, C#, Javascript, Kotlin, Pascal, Python, R, and Scala. These code files cover seven introductory materials: the output, input, branching, looping, function, array, and matrix.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
“…The same-language dataset is the Java dataset used for creating the introductory dataset [28]. Each original code file was copied and modified according to the last six levels of similarity disguise taxonomy [16] (the first level was excluded, as it depicted no disguises) by nine teaching assistants.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
“…Both techniques of raising initial suspicion require lecturer's knowledge about student academic performance. We cannot therefore use publicly available data sets [21], [22], [23], [24] as they have no such information. We create our own data set just for this purpose.…”
Section: Introductionmentioning
confidence: 99%