Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art.(2) We show how the meta learning approach of Koppel and Schler, termed ''unmasking'', can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.
Problem statementIn the following, the term plagiarism refers to text plagiarism, i.e., the use of another author's information, language, or writing, when done without proper acknowledgment of the original source. Plagiarism detection refers to the unveiling of text plagiarism. Existing approaches to computer-based plagiarism detection break down this task into manageable parts:''Given a text d and a reference collection D, does d contain a section s for which one can find a document d i [ D that contains a section s i such that under some retrieval model R the similarity u R between s and s i is above a threshold h?''Observe that research on automated plagiarism detection presumes a closed world where a reference collection D is given. Since D can be extremely largepossibly the entire indexed part of the World Wide Web-the main research focus is on efficient search technology: near-similarity search and near-duplicate detection (Brin et al