Summary
Plagiarism is becoming an increasingly serious problem in academic environment. In this paper, we deal with a specific kind of plagiarism: source code plagiarism. In this case, there is no software available for detecting plagiarism on a larger scale (hundreds of student submissions every year). We propose algorithms for source code parsing and processing as a part of a complex system for plagiarism detection. A source code vectorization using characteristic vectors is a vital piece of the whole process, and k‐means algorithm helps with the classification and clustering of vectors. Student assignments are submitted regularly, and any plagiarism detection system needs to handle them as they come. For this reason, we propose a modified incremental k‐means algorithm and a method for determining the number of clusters. We also consider methods for vector search among clusters and suggest the use of conditional entropy to select the important vector elements used in the search algorithm. Our results show how the proposed algorithms and methods work on real student submissions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.