$CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph

Sarwar, Raheem; Urailertprasert, Norawit; Vannaboot, Nattapol; Yu, Chenyun; Rakthanmanon, Thanawin; Chuangsuwanich, Ekapol; Nutanong, Sarana

doi:10.1109/access.2020.2967449

Cited by 18 publications

(7 citation statements)

References 52 publications

(113 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sarwar et al [23] developed a multi-authorship classification system that achieved 76.92% accuracy for 1, 360 text documents. Their classification system depends on co-author information.…”

Section: A Non-bengali Language-based Authorship Classificationmentioning

confidence: 99%

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

et al. 2021

View full text Add to dashboard Cite

Authorship classification is a technique of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification system made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors' classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and selected the best 9 performing models out of the 90 models for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively.INDEX TERMS Natural language processing, Authorship classification, resource constraint language, semantic feature extraction, deep learning.

show abstract

“…Sarwar et al [23] developed a multi-authorship classification system that achieved 76.92% accuracy for 1, 360 text documents. Their classification system depends on co-author information.…”

Section: A Non-bengali Language-based Authorship Classificationmentioning

confidence: 99%

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

et al. 2021

View full text Add to dashboard Cite

show abstract

“…• Type-token ratio: the ratio of the total number of unique tokens to the total number of tokens: uniq(N i,tokens )/N i,tokens (11) where N i,tokens and uniq(N i,tokens ) are the total number of tokens and the total number of unique tokens in text x i,d , respectively. A token is a general term that could refer, for example, to a word, a number, or a punctuation mark.…”

Section: A Vocabulary Richnessmentioning

confidence: 99%

“…In this case, x i,d [g] denotes the shape of the g th word in text x i,d . Example grams: [11] = ''sss''.…”

Section: B Classical N-gramsmentioning

confidence: 99%

“…Fundamentally, electronic text stylometry problems aim at inferring information about authors of input electronic texts. Such inferred information could be the identity of the authors, their genders, age groups, personality types, or even the diagnosis of specific illnesses [6], [7], [11]- [15]. A common taxonomy of electronic text stylometry problem solvers that is often followed by the literature is as follows:…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Authorship Identification of Electronic Texts

2021

View full text Add to dashboard Cite

Electronic text stylometry is concerned with analyzing the writing styles of input electronic texts to extract information about their authors. For example, such extracted data could be the authors' identity or other aspects, such as their gender and age group. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation.2) A survey of data representation (or feature extraction) methods. 3) A comprehensive evaluation of 23, 760 feature extraction methods followed by a thorough discussion of the results. This extensive evaluation is critical since the known data representation methods are often not evaluated under the same unified testbed.

show abstract

“…Consequently, a huge amount of UGC (user-generated-content) such as blog posts, product reviews, articles and novels is continuously being generated by the non-native writers [9,30]. Therefore, performing NLI with UGC can be useful in several areas such as forensic linguistics, author profling and authorship identifcation [9,18,29,30,34,37,38]. For example, in the context of the forensic linguistics, a juncture where the linguistic stylistics and the legal system intersect [23], NLI can be considered as a useful tool to provide evidence regarding the linguistic background of an author.…”

Section: Introductionmentioning

confidence: 99%

Native Language Identification of Fluent and Advanced Non-Native Writers

Sarwar

Rutherford

Hassan

et al. 2020

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Self Cite

View full text Add to dashboard Cite

Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top- k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top- k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French , and German . Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.

show abstract

$CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph

Cited by 18 publications

References 52 publications

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

Authorship Identification of Electronic Texts

Native Language Identification of Fluent and Advanced Non-Native Writers

Contact Info

Product

Resources

About