BERT-based Authorship Attribution on the Romanian Dataset called ROST

Sanda-Maria, Avram,

doi:10.48550/arxiv.2301.12500

Cited by 1 publication

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Focusing on datasets with 10 authors (see Table 12), we observe that our RoBERT model outperformed the existing approaches [22,51] for both FT and PP corpora. Additionally, our hybrid RoBERT model, which incorporates RBI features, achieved the highest F1 score of 0.95 for the PP corpus, indicating the effectiveness of leveraging both textual and numerical features for AA tasks.…”

Section: Comparison With Existing Methodsmentioning

confidence: 89%

“…A second study by Avram [51] focused on authorship attribution using pre-trained language models, particularly BERT, to detect the authorship of Romanian texts. Similar to the previous study, this research used the same dataset, which is highly unbalanced in terms of the number of texts per author, source, time period, and writing type.…”

Section: Authorship Attribution In Romanianmentioning

confidence: 99%

See 1 more Smart Citation

Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian

Nitu,

Dascalu

2024

Applied Sciences

View full text Add to dashboard Cite

Authorship attribution for less-resourced languages like Romanian, characterized by the scarcity of large, annotated datasets and the limited number of available NLP tools, poses unique challenges. This study focuses on a hybrid Transformer combining handcrafted linguistic features, ranging from surface indices like word frequencies to syntax, semantics, and discourse markers, with contextualized embeddings from a Romanian BERT encoder. The methodology involves extracting contextualized representations from a pre-trained Romanian BERT model and concatenating them with linguistic features, selected using the Kruskal–Wallis mean rank, to create a hybrid input vector for a classification layer. We compare this approach with a baseline ensemble of seven machine learning classifiers for authorship attribution employing majority soft voting. We conduct studies on both long texts (full texts) and short texts (paragraphs), with 19 authors and a subset of 10. Our hybrid Transformer outperforms existing methods, achieving an F1 score of 0.87 on the full dataset of the 19-author set (an 11% enhancement) and an F1 score of 0.95 on the 10-author subset (an increase of 10% over previous research studies). We conduct linguistic analysis leveraging textual complexity indices and employ McNemar and Cochran’s Q statistical tests to evaluate the performance evolution across the best three models, while highlighting patterns in misclassifications. Our research contributes to diversifying methodologies for effective authorship attribution in resource-constrained linguistic environments. Furthermore, we publicly release the full dataset and the codebase associated with this study to encourage further exploration and development in this field.

show abstract

Section: Comparison With Existing Methodsmentioning

confidence: 89%

Section: Authorship Attribution In Romanianmentioning

confidence: 99%