Authorship Attribution in Bangla literature using Character-level CNN

Khatun, Aisha; Rahman, Anisur; Islam, Md. Saiful; Marium-E-Jannat,

doi:10.1109/iccit48885.2019.9038560

Cited by 13 publications

(19 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They built a corpus consisting of 3125 passages and gained the highest accuracy (96%) with random forest than Naive Bayes (62%) and decision tree (85%) classifiers. Khatun et al [6] introduced a character-level CNN for attributing Bengali authorship. This system's performance decreased with an increased number of authors and sample texts.…”

Section: B Bengali Language Based Authorship Classificationmentioning

confidence: 99%

“…The layer-wise weight values are stored through the metafile. To investigate the effect of authorship classification performance, LSTM [19], Char-level-CNN [6], SVM [10], SGD [50], Multilingual pre-trained BERT (M-BERT) [51] and Distil-BERT [52] classifiers are also implemented on the same datasets.…”

Section: Ex(authormentioning

confidence: 99%

“…The feature vector is projected into an output block weight matrix with a shape of 18 × 384. The Soft-Max function produces AES using equation (6). The AES is defined as A = {as 1 , as 2 , as 3 , ..., as 18 } and finds the maximum expected value (Max(A)) with a corresponding index of the predicted author name.…”

Section: Classifier Testing Modulementioning

confidence: 99%

“…In recent years, authorship classification has attracted much attention by Bengali Language Processing (BLP) re-There are several statistical and machine learning (ML) techniques that addressed the authorship attribution problem. The statistical methods are used for extracting stylometry features [2]- [5], character level embedding features [6], [7], and n-gram features [8], [9] for authorship attribution. Several ML techniques such as support vector machine (SVM), Naive Bayes (NB) [10]- [12], CNN [6], [13]- [15] are the most commonly used methods for authorship classification.…”

Section: Introductionmentioning

confidence: 99%

“…The statistical methods are used for extracting stylometry features [2]- [5], character level embedding features [6], [7], and n-gram features [8], [9] for authorship attribution. Several ML techniques such as support vector machine (SVM), Naive Bayes (NB) [10]- [12], CNN [6], [13]- [15] are the most commonly used methods for authorship classification. Nevertheless, stylometry, character level, and n-gram feature extractors cannot capture the sentence and document level semantic and syntactic features.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

et al. 2021

View full text Add to dashboard Cite

Authorship classification is a technique of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification system made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors' classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and selected the best 9 performing models out of the 90 models for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively.INDEX TERMS Natural language processing, Authorship classification, resource constraint language, semantic feature extraction, deep learning.

show abstract

Section: B Bengali Language Based Authorship Classificationmentioning

confidence: 99%