Authorship identification of documents with high content similarity

Rexha, Andi; Kroll, Mark W.; Ziak, Hermann; Kern, Roman

doi:10.1007/s11192-018-2661-6

Cited by 35 publications

(22 citation statements)

References 14 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The assumption is that documents or texts clustered together are more likely to be written by the same author. Rexha, Kröll, Ziak, and Kern (2018) explain that authorship recognition can be done using document clustering where the author of a disputed or controversial text can be identified from a set of candidate authors. Theodoridis and Koutroubas (2003) suggest that text clustering is one of the most primitive mental activities of humans.…”

Section: Methodology Methodsmentioning

confidence: 99%

Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic

Alsager¹

2020

AWEJ

View full text Add to dashboard Cite

Numerous studies have been concerned with developing new authorship recognition systems to address the increasing rates of cybercrimes associated with the anonymous nature of social media platforms, which still offer the opportunity for the users not to reveal their true identities. Nevertheless, it is still challenging to identify the real authors of social media’s offensive and inappropriate content. These contents are usually very short; therefore, it is challenging for stylometric authorship systems to assign controversial texts to their real authors based on the salient and distinctive linguistic features and patterns within these contents. This research introduces a new stylometric authorship system that considers both the shortness of data and the peculiar linguistic properties of Arabic. A corpus of 20, 357 tweets from 134 Twitter users. A document clustering based on Document Index Graph (DIG) model was used to classify input patterns in the tweets that shared common linguistic features. A comparative analysis using Vector Space Clustering (VSC) model based on the Bag of Words (BOW) model, conventionally used in authorship recognition applications, was used. Results indicate that the proposed system is more accurate than other standard authorship systems mainly based on vector space clustering methods. It was also clear that the model had the advantage of providing complete information about the documents and the degree of overlap between every pair of documents, which was useful in determining the similarity between documents.

show abstract

Section: Methodology Methodsmentioning

confidence: 99%

Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic

Alsager¹

2020

AWEJ

View full text Add to dashboard Cite

show abstract

“…The text analysis is field with different topic as the linguistic [ [206] , [207] , [208] ], the stylometry [ 209 ], and text classification [ 210 ].…”

Section: Miscellaneousmentioning

confidence: 99%

Interpol review of questioned documents 2016–2019

Deviterne-Lapeyre

2020

Forensic Science International: Synergy

View full text Add to dashboard Cite

show abstract

“…Seiring dengan proses otomatisasi di segala bidang, maka makna stilometri mengalami pergeseran dan difenisikan oleh Halvani [1] sebagai "cabang ilmu yang menentukan kepemilikan pengarang terhadap karya-karya tulis melalui analisis statistik dan pembelajaran mesin". Analisis stilometri banyak diterapkan dalam aplikasi komputasional yang lebih kompleks seperti pada Identifikasi Kepengarangan (authorship identification) oleh Rexha dkk [2], atribusi dan diarisasi penulis oleh Stamatatos dkk [3], atau Deteksi Plagiasi Intrinsik (DPI) oleh Rexha [1] dan Kuznetsov dkk [4].…”

Section: Pendahuluanunclassified

“…Selain itu, sebagian besar sistem DPI menggunakan beberapa fitur sekaligus daripada hanya mengandalkan fitur tunggal. Fitur stilometri lainnya yang kerap digunakan adalah frekuensi panjang kata [2], [12], Frekuensi panjang kalimat [12], [13], Frekuensi tag kelas kata (part of speech (pos) tag frequency) [12], Rasio type-token [2], [13], dan frekuensi kespesifikan kata [12].…”

Section: Pendahuluanunclassified

Analysis of Stylometric Features and Segmentation Strategies in Intrinsic Plagiarism Detection System

Gunawan

Krisnawati

Chrismanto³

2020

RESTI

View full text Add to dashboard Cite

Two different paradigms in the field of plagiarism detection resulting in External Plagiarism Detection (EPD) and Intrinsic Plagiarism Detection (IPD) systems. The most common applied system is EPD, which requires its algorithm to make a heuristic comparison between a suspicious document with documents in a corpus. In contrast, given a suspicious document only, an algorithm of IPD should be able to find the plagiarism section by looking for text segments having different writing styles. Previous researches for Indonesian texts fell only in the field of the EPD development system. Therefore, this research focuses on and contributes to experimenting and analyzing the stylometric features and segmentation strategies to build an IPD system for Indonesian texts. The experimentation results show that the paragraph segment performs better by scoring 0.92 for Macro Averaged-Accuracy and 0.54 for Macro Averaged-F1. The stylometric features achieving the highest scores of F-1 and Accuracy are the frequency of punctuation, the average paragraph length, and the type-token ratio.

show abstract

Authorship identification of documents with high content similarity

Cited by 35 publications

References 14 publications

Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic

Towards a Stylometric Authorship Recognition Model for the Social Media Texts in Arabic

Interpol review of questioned documents 2016–2019

Analysis of Stylometric Features and Segmentation Strategies in Intrinsic Plagiarism Detection System

Contact Info

Product

Resources

About