A Novel Document Representation Approach for Authorship Attribution

Mekala, Sreenivas; Tippireddy, Raghunadha; Vardhan, B. Vishnu

doi:10.22266/ijies2018.0630.28

Cited by 5 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combining ML and NLP techniques, such as multinomial naïve Bayes (MNB), support vector machine (SVM), expectation-maximization algorithm (EM), and stop words, lemmatization, and stemming [24], are used to identify fake reviews. Mekala et al [25] demonstrate high precision by utilizing the approach of stylistic characteristics and term weight measurement. Saha et al [26] achieved 96 percent accuracy using MLP on a dataset of social text from social media platforms.…”

Section: Literature Reviewmentioning

confidence: 99%

Policy-Based Spam Detection of Tweets Dataset

et al. 2023

View full text Add to dashboard Cite

Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Policy-Based Spam Detection of Tweets Dataset

et al. 2023

View full text Add to dashboard Cite

show abstract

“…The formal study of authorship analysis started in the 19th century, it was first tackled with linguistic approaches and eventually by statistical and computational methods [3]. These tasks continue to grow attention for their practical applications; for example, in a variety of computer crime investigations ranging from homicide to identity theft and many types of financial crimes [4] or in the context of identifying the author of source code [5].…”

Section: Related Workmentioning

confidence: 99%

Graph-Based Siamese Network for Authorship Verification

et al. 2022

View full text Add to dashboard Cite

In this work, we propose a novel approach to solve the authorship identification task on a cross-topic and open-set scenario. Authorship verification is the task of determining whether or not two texts were written by the same author. We model the documents in a graph representation and then a graph neural network extracts relevant features from these graph representations. We present three strategies to represent the texts as graphs based on the co-occurrence of the POS labels of words. We propose a Siamese Network architecture composed of graph convolutional networks along with pooling and classification layers. We present different variants of the architecture and discuss the performance of each one. To evaluate our approach we used a collection of fanfiction texts provided by the PAN@CLEF 2021 shared task in two settings: a “small” corpus and a “large” corpus. Our graph-based approach achieved average scores (AUC ROC, F1, Brier score, F0.5u, and C@1) between 90% and 92.83% when training on the “small” and “large” corpus, respectively. Our model obtain results comparable to those of the state of the art in this task and greater than traditional baselines.

show abstract

“…Recently, the latter has been combined with other stylometric features. For example, Sapkota et al (2014) used 13 stylometric features: number of sentences, number of tokens per sentence, number of punctuation marks per sentence, and so forth; Mekala et al (2018) extracted 39 stylometric features such as character count, block-letter words, and average sentence length in terms of characters/words; Wu et al (2021) combined four features of statistical style (i.e., average word/sentence length, letter frequency, numbers of 26 letters, and punctuation marks), three content features, two syntactic features, and one semantic feature to predict the author.…”

Section: • Sentence Lengthmentioning

confidence: 99%

A review on authorship attribution in text mining

Zheng

Jin

2022

WIREs Computational Stats

View full text Add to dashboard Cite

The issue of authorship attribution has long been considered and continues to be a popular topic. Because of advances in digital computers, this field has experienced rapid developments in the last decade. In this article, a survey of recent advances in authorship attribution in text mining is presented. This survey focuses on authorship attribution methods that are statistically or computationally supported as opposed to traditional literary approaches. The main aspects covered include the changes in research topics over time, basic feature metrics, machine learning techniques, and the advantages and disadvantages of each approach. Moreover, the corpus size, number of candidates, data imbalance, and result description, all of which pose challenges in authorship attribution, are discussed to inform future work.This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining

show abstract

A Novel Document Representation Approach for Authorship Attribution

Abstract: Abstract:The rapidly growing data in the web result in stolen, unidentified and fraudulent data. Identification of such data is of a prime objective for forensic departments, researchers and governments.

Cited by 5 publications

References 10 publications

Policy-Based Spam Detection of Tweets Dataset

Policy-Based Spam Detection of Tweets Dataset

Graph-Based Siamese Network for Authorship Verification

A review on authorship attribution in text mining

Contact Info

Product

Resources

About