Authorship Attribution of Internet Comments with Thousand Candidate Authors

Kapočiūtė-Dzikienė, Jurgita; Utka, Andrius; Šarkutė, Ligita

doi:10.1007/978-3-319-24770-0_37

Cited by 3 publications

(4 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They noted that the performance of set of rich linguistic features was better for author prediction when compared with word frequencies and trigrams of characters. Another researchers obtained [7] best results when combination of word based and character tetragrams features are used. In [8], the researchers extracted POS bigrams and trigrams, character trigrams, percentage of direct speech from the documents and syntactic features.…”

Section: Literature Surveymentioning

confidence: 98%

A Novel Document Representation Approach for Authorship Attribution

Mekala¹,

Tippireddy²,

Vardhan³

2018

IJIES

View full text Add to dashboard Cite

show abstract

Section: Literature Surveymentioning

confidence: 98%

A Novel Document Representation Approach for Authorship Attribution

Mekala¹,

Tippireddy²,

Vardhan³

2018

IJIES

View full text Add to dashboard Cite

show abstract

“…Despite for the Lithuanian language there are done: 1) lots of descriptive research works (e.g., [14], [15]); 2) some experiments with machine learning (carried out on parliamentary transcripts or forum posts of only 100 candidate authors) [16] or similarity-based approaches (using very limited training data) [17]; these findings do not guarantee the best results for our solving AA task. Our aim is at performing the comparative investigation and at finding the best method, feature type, and feature selection technique for our AA task (with 10, 100, and 1,000 candidate authors) on the corpus of the Lithuanian Internet comments.…”

Section: Related Workmentioning

confidence: 99%

“…The SB-RFS technique is adjusted to cope with very concise texts; performs especially well on a small number of features, because the final attribution decision incorporates the generalized results of several decisions obtained during a few iterations. In our experiments we used SB-TopN and SB-RFS implementations presented in [17].…”

Section: Proceedings Of the Fedcsis Prague 2017mentioning

confidence: 99%

See 1 more Smart Citation

A Comparison of Authorship Attribution Approaches Applied on the Lithuanian Language

Kapočiūtė-Dzikienė

Venčkauskas

Damaševičius

2017

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

Abstract-This paper reports comparative authorship attribution results obtained on the Internet comments of the morphologically complex Lithuanian language. We have explored the impact of machine learning and similarity-based approaches on the different author set sizes (containing 10, 100, and 1,000 candidate authors), feature types (lexical, morphological, and character), and feature selection techniques (feature ranking, random selection). The authorship attribution task was complicated due to the used Lithuanian language characteristics, nonnormative texts, an extreme shortness of these texts, and a large number of candidate authors. The best results were achieved with the machine learning approaches. On the larger author sets the entire feature set composed of word-level character tetra-grams demonstrated the best performance.

show abstract

Open Class Authorship Attribution of Lithuanian Internet Comments using One-Class Classifier

Venčkauskas

Karpavičius

Damaševičius

et al. 2017

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

Internet can be misused by cyber criminals as a platform to conduct illegitimate activities (such as harassment, cyber bullying, and incitement of hate or violence) anonymously. As a result, authorship analysis of anonymous texts in Internet (such as emails, forum comments) has attracted significant attention in the digital forensic and text mining communities. The main problem is a large number of possible of authors, which hinders the effective identification of a true author. We interpret open class author attribution as a process of expert recommendation where the decision support system returns a list of suspected authors for further analysis by forensics experts rather than a single prediction result, thus reducing the scale of the problem. We describe the task formally and present algorithms for constructing the suspected author list. For evaluation we propose using a simple Winner-Takes-All (WTA) metric as well as a set of gain-discount model based metrics from the information retrieval domain (mean reciprocal rank, discounted cumulative gain and rank-biased precision). We also propose the List Precision (LP) metric as an extension of WTA for evaluating the usability of the suspected author list. For experiments, we use our own dataset of Internet comments in Lithuanian language and consider the use of language-specific (Lithuanian) lexical features together with general lexical features derived from English language. For classification we use one-class Support Vector Machine (SVM) classifier. The results of experiments show that the usability of open class author attribution can be improved considerably by using a set of language-specific lexical features together with general lexical features, while the proposed method can be used to reduce the number of suspected authors thus alleviating the work of forensic linguists.

show abstract

Authorship Attribution of Internet Comments with Thousand Candidate Authors

Cited by 3 publications

References 31 publications

A Novel Document Representation Approach for Authorship Attribution

A Novel Document Representation Approach for Authorship Attribution

A Comparison of Authorship Attribution Approaches Applied on the Lithuanian Language

Open Class Authorship Attribution of Lithuanian Internet Comments using One-Class Classifier

Contact Info

Product

Resources

About