Cross-Domain Authorship Attribution Using Pre-trained Language Models

Barlas, Georgios; Stamatatos, Efstathios

doi:10.1007/978-3-030-49161-1_22

Cited by 44 publications

(42 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Author Profile (AP) BERT and RoBERTa (Barlas and Stamatatos, 2020Stamatatos, , 2021. We trained a separate neural language model for each author in the dataset where the embedding layer is initialized with embeddings from BERT and RoBERTa.…”

Section: Pretrained Language Modelsmentioning

confidence: 99%

“…Since the first computational approach to authorship attribution (Mosteller and Wallace, 1963), researchers have aimed at finding new sets of fea-tures for current domains/languages, adapting existing features to new languages or communication domains, or using new classification techniques, e.g. (Abbasi and Chen, 2006;Stamatatos, 2013;Silva et al, 2011;Layton et al, 2012;Iqbal et al, 2013;Zhang et al, 2018;Altakrori et al, 2018;Barlas and Stamatatos, 2020). Alternatively, motivated by the real-life applications of authorship attribution different elements of and constraints on the attribution process have been investigated (Houvardas and Stamatatos, 2006;Luyckx and Daelemans, 2011;Goldstein-Stewart et al, 2009;Stamatatos, 2013;Wang et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution

Altakrori¹,

Cheung²,

Fung³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by a failure to capture authorship writing style or by a topic shift. Motivated by this, we propose the topic confusion task where we switch the author-topic configuration between the training and testing sets. This setup allows us to distinguish two types of errors: those caused by the topic shift and those caused by the features' inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level n-grams.

show abstract

Section: Pretrained Language Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution

Altakrori¹,

Cheung²,

Fung³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

show abstract

“…We use a pretrained SBERT model (Reimers and Gurevych, 2019) but update all model parameters during training. Prior work has explored self-attention models for authorship attribution (Saedi and Dras, 2020;Fabien et al, 2020;Barlas and Stamatatos, 2020) with mixed success compared to simpler convolutional models. These systems have utilized either the output or the classification token of BERT as the basis for learning authorship embeddings.…”

Section: Modelmentioning

confidence: 99%

“…On the other hand, if authorship features could be learned in a domainindependent fashion, it would reduce the need for in-domain training sets by exploiting transfer between domains: authorship representations could be learned from a large but out-of-domain corpus and subsequently deployed in a target domain. In prior work, Barlas and Stamatatos (2020) perform a study on cross-domain author verification in a closed world of 21 authors. In contrast, we consider an open-world setting with several orders of magnitude more authors.…”

Section: Introductionmentioning

confidence: 99%

Learning Universal Authorship Representations

Rivera-Soto¹,

Miano²,

Ordonez³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Determining whether two documents were composed by the same author, also known as authorship verification, has traditionally been tackled using statistical methods. Recently, authorship representations learned using neural networks have been found to outperform alternatives, particularly in large-scale settings involving hundreds of thousands of authors. But do such representations learned in a particular domain transfer to other domains? Or are these representations inherently entangled with domain-specific features? To study these questions, we conduct the first large-scale study of cross-domain transfer for authorship verification considering zero-shot transfers involving three disparate domains: Amazon reviews, fanfiction short stories, and Reddit comments. We find that although a surprising degree of transfer is possible between certain domains, it is not so successful between others. We examine properties of these domains that influence generalization and propose simple but effective methods to improve transfer.

show abstract

“…Of course, in the real-life scenarios, the authors differ in the topics and genre (e.g., documents, e-mail, tweet, etc. ), but the main challenge is to focus on the author's stylometry [25,26]. Meanwhile, the information from the cross-topic or cross-genre could mislead the model [25], which makes the authorship verification difficult [27].…”

Section: Collecting and Preprocessing The Datamentioning

confidence: 99%

Design and Analysis of a Novel Authorship Verification Framework for Hijacked Social Media Accounts Compromised by a Human

Alterkavı

Erbay

2021

Security and Communication Networks

View full text Add to dashboard Cite

Compromising the online social network account of a genuine user, by imitating the user’s writing trait for malicious purposes, is a standard method. Then, when it happens, the fast and accurate detection of intruders is an essential step to control the damage. In other words, an efficient authorship verification model is a binary classification for the investigation of the text, whether it is written by a genuine user or not. Herein, a novel authorship verification framework for hijacked social media accounts, compromised by a human, is proposed. Significant textual features are derived from a Twitter-based dataset. They are composed of 16124 tweets with 280 characters crawled and manually annotated with the authorship information. XGBoost algorithm is then used to highlight the significance of each textual feature in the dataset. Furthermore, the ELECTRE approach is utilized for feature selection, and the rank exponent weight method is applied for feature weighting. The reduced dataset is evaluated with many classifiers, and the achieved result of the F-score is 94.4%.

show abstract

Cross-Domain Authorship Attribution Using Pre-trained Language Models

Cited by 44 publications

References 16 publications

The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution

The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution

Learning Universal Authorship Representations

Design and Analysis of a Novel Authorship Verification Framework for Hijacked Social Media Accounts Compromised by a Human

Contact Info

Product

Resources

About