Intrinsic Plagiarism Detection using N-gram Classes

Bensalem, Imene; Rosso, Paolo; Chikhi, Salim

doi:10.3115/v1/d14-1153

Cited by 20 publications

(27 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Before starting the analysis, let us recall that Stamatatos' method is a well-known IPD method and we provided a brief description of it in Section 2.2. As for our method, it was first introduced in the short paper (Bensalem et al 2014), and we will provide a detailed description of it in the next section.…”

Section: Discussionmentioning

confidence: 99%

“…One of the most straightforward text representation approaches used in IPD methods is character n-grams. Some methods use them alone (Bensalem et al 2014;Kestemont et al 2011;Stamatatos 2009a), while others include additional features (Kern et al 2012;Kuznetsov et al 2016;Rao et al 2011;Stein et al 2011). Character n-grams are known to be a powerful and effective text representation in style analysisbased tasks such as authorship attribution (Kešelj et al 2003;Stamatatos 2016) and authorship verification (Brocardo et al 2013;Jankowska et al 2014).…”

mentioning

confidence: 99%

“…In other words, to try to describe plagiarism in terms of character ngrams by considering their frequency ranges (frequent or infrequent). We conduct our investigation using two character-grams-based methods: our method (Bensalem et al 2014) that we will describe in this paper, and the well-known IPD method of Stamatatos (2009a).…”

mentioning

confidence: 99%

See 2 more Smart Citations

On the use of character n-grams as the only intrinsic evidence of plagiarism

Bensalem

Rosso

Chikhi

2019

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character ngrams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-ofthe-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low-and the high-frequency ngrams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited. Keywords Intrinsic plagiarism detection . Character n-grams . Stylistic features . Writing style analysisWe are very grateful to the anonymous reviewers for their insightful suggestions and constructive comments that greatly improved the paper.

show abstract

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

On the use of character n-grams as the only intrinsic evidence of plagiarism

Bensalem

Rosso

Chikhi

2019

Lang Resources & Evaluation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Bensalem mengenalkan metode pendeteksian plagiarisme intrinsik bahasa baru yang berbasis pada representasi teks baru dalam kelas n-gram / pengklasifikasian kemunculan n-gram. Sebagai contoh tingkat kelas kemunculan yang paling sering muncul, kelas kemunculan paling sering dan kelas kemunculan menengah [8]. Palkovskii menggabungkan semua hasil penelitian sebelumnya dari penelitian PAN12 dan PAN13 dan memperbaiki metode pendeteksian plagiarisme yang dikembangkan sebelumnya, dengan bantuan: n-gram kontekstual, n-gram konteks sekitar, n-gram berbasis entitas, dan lainlain [9].…”

Section: Pendahuluanunclassified

Analisa Perbandingan Jenis N-GRAM Dalam Penentuan Similarity Pada Deteksi Plagiat

Pratama¹,

Utami²,

Arief³

2019

citec

View full text Add to dashboard Cite

2 ema.u@amikom.ac.id, 3 rudy@amikom.ac.id 2 email Abstrak Dampak.akses informasi yang mudah membuat tindakan plagiasi makin marak. Tindakan tersebut dapat dicegah dengan menggunakan sistem deteksi plagiat. Sistem tersebut dapat dibangun dengan menggunakan konsep similarity dengan algoritma rabin-karp sebagai string matchingnya dan n-gram sebagai metode parsingnya. Penelitian terdahulu menggunakan kedua algoritma tersebut menunjukkan hasil sistem yang cukup baik untuk deteksi plagiat. Kemudian hasil penelitian dari luar negeri ada yang melakukan hal serupa mengenai deteksi plagiat serta menghasilkan penemuan baru misalnya cross-language similarity. Selain itu ada temuan faktafakta baru mengenai deteksi plagiat dengan berbagai cara pengujian serta penggabungan berbagai metode yang sudah ada untuk perbaikan hasil deteksi. Sedangkan tujuan kami pada penelitian ini adalah membandingkan metode parsing untuk mengetahui metode parsing yang mana yang dapat memberikan hasil paling cepat dan masih dalam nilai akurasi yang wajar. Kami sebagai kontrol ukuran akurasi kami menggunakan plagiarism checker x free. Kami menggunakan aplikasi tersebut untuk menentukan akurasi instrumen uji kami menggunakan selisih similarity aplikasi ini dengan instrumen uji kami. Hasilnya kami menemukan fakta jika ngram word memiliki akurasi yang paling optimal dibanding n-gram yang lain dan masih relatif paling cepat dibanding lainnya. Kata Kunci -perbandingan, ngram, similarity text, deteksi plagiat AbstractThe impact of easy information access makes plagiarism more and more prevalent. Such action can be prevented by using a plagiarism detection system. The system can be constructed using the concept of similarity with the rabin-karp algorithm as its matching string and n-gram as its parsing method. Earlier studies using both algorithms show good system results for plagiarism detection. Then the results of research from abroad have done the same about the detection of plagiarism and produce new inventions such as cross-language similarity. In addition, there are new facts about plagiarism detection by various testing methods and incorporating existing methods for improving the detection. While our goal in this study is to compare the method of parsing to find out which parsing method that can provide the fastest results and still in a reasonable accuracy value. We measure our accuracy as accurate using plagiarism checker x free. We use the application to determine the accuracy of our test instruments using the similarity difference of this application with our test instruments. We found that n-gram word has the most optimal accuracy compared to other n-grams and is still relatively fastest compared to others.

show abstract

“…In [17] explained n-gram class as a number from 0 to m−1 such that the class labeled 0 involves the least frequent n-grams and the class labeled m−1 contains the most frequent n-grams in a document. If m > 2, classes between 0 and m−1 will contain n-grams with intermediate frequency levels.…”

Section: Methodology and System Frameworkmentioning

confidence: 99%

Plagiarism Detection System for the Kurdish Language

Wakil¹,

Ghafoor²,

Abdulrahman³

et al. 2017

IJITCS

View full text Add to dashboard Cite

Abstract-One of the serious issues is plagiarism, especially in the education field. Detecting the plagiarism became a challenging task, particularly in natural language texts. In the past years, some plagiarism detection tools have been developed for diverse natural languages, mainly English. Language-independent tools exist as well but are considered as too restrictive as they usually do not consider specific language features. The problem is there is no plagiarism Detection system for the Kurdish language. In this paper, we introduce a new system for plagiarism detection for Kurdish Language, based on n-gram algorithm, our system can detect the word, phrases, and paragraphs. Moreover, our system effectiveness for detect plagiarist texts in localhost and online especially in Google search engine. This system is more useful for the academic organizations such as schools, institutes, and universities for finding copied texts from another document.

show abstract

Intrinsic Plagiarism Detection using N-gram Classes

Cited by 20 publications

References 10 publications

On the use of character n-grams as the only intrinsic evidence of plagiarism

On the use of character n-grams as the only intrinsic evidence of plagiarism

Analisa Perbandingan Jenis N-GRAM Dalam Penentuan Similarity Pada Deteksi Plagiat

Plagiarism Detection System for the Kurdish Language

Contact Info

Product

Resources

About