Comparison of Cross-Validation and Test Sets Approaches to Evaluation of Classifiers in Authorship Attribution Domain

Baron, Grzegorz

doi:10.1007/978-3-319-47217-1_9

Cited by 17 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By using the precompiled Songkorpus, we empirically tested the accuracy of different authorship classificators on a reliable dataset. The results of our best model seems promising and are in accordance with comparable research reports on naïve bayes classifiers (Rish, 2001;Dai et al, 2007;Labatut & Cherifi, 2012;Nitze, Schulthess & Asche, 2012;Altheneyan & Menai, 2014;Baron, 2016;Shih, Stow, & Tsai, 2019). It can be concluded from our experiments that the Naive Bayes classifier seems to be a good choice for authorship attribution of song lyrics, at least for the investigated singer-songwriter dataset.…”

Section: Discussionsupporting

confidence: 90%

Automatic Authorship Classification for German Lyrics Using Naïve Bayes

Mendhakar

Tilmatine

2023

JLCL

View full text Add to dashboard Cite

Text classification is a prevalent and essential machine-learning task. Machine learning classifiers have developed immensely since their inception. The naïve Bayes classifier is one of the most prominent supervised machine learning classifiers. In this experiment, we highlight the performance of Naïve Bayes for classifying of authors/artists on the German lyrics corpus (“Songkorpus”) and compare the classification results with other classifier algorithms. The corpus of investigation consists of six artists with 970 songs in total. Bayes model evaluation measures revealed a precision of 0.91, recall of 0.94, and F1-measure of 0.9. Furthermore, the classification performance with other classifier algorithms did not reveal any statistically significant difference in performance. The results of the study add to the high volume of reports on the classification accuracy of Naive Bayes for the task of lyrical classification.

show abstract

Section: Discussionsupporting

confidence: 90%

Automatic Authorship Classification for German Lyrics Using Naïve Bayes

Mendhakar

Tilmatine

2023

JLCL

View full text Add to dashboard Cite

show abstract

“…Based on this information, together with the nucleotide density information, nucleotide N at i th position from sequence S (with length lÞ can be represented by the formula Ni = fx i ; y i ; z i ; d i gði = 1; 2; 3; .lÞ which satisfies the following equations: 24,27,35 We evaluated the performance of these algorithms by an independent testing dataset, since the evaluation by cross-validation may over-estimate the performance of models. 39 The R package caret was used to construct machine learning models, and all parameters were set by default for primitive evaluation. The results are shown in Table 1.…”

Section: Nucleotide Chemical Propertymentioning

confidence: 99%

m5UPred: A Web Server for the Prediction of RNA 5-Methyluridine Sites from Sequences

Jiang

Tang

Chen

et al. 2020

Molecular Therapy - Nucleic Acids

View full text Add to dashboard Cite

As one of the widely occurring RNA modifications, 5-methyluridine (m 5 U) has recently been shown to play critical roles in various biological functions and disease pathogenesis, such as under stress response and during breast cancer development. Precise identification of m 5 U sites on RNA is vital for the understanding of the regulatory mechanisms of RNA life. We present here m5UPred, the first web server for in silico identification of m 5 U sites from the primary sequences of RNA. Built upon the support vector machine (SVM) algorithm and the biochemical encoding scheme, m5UPred achieved reasonable prediction performance with the area under the receiver operating characteristic curve (AUC) greater than 0.954 by 5-fold cross-validation and independent testing datasets. To critically test and validate the performance of our newly proposed predictor, the experimentally validated m 5 U sites were further separated by high-throughput sequencing techniques (mi-CLIP-Seq and FICC-Seq) and cell types (HEK293 and HAP1). When tested on cross-technique and cross-cell-type validation using independent datasets, m5UPred achieved an average AUC of 0.922 and 0.926 under mature mRNA mode, respectively, showing reasonable accuracy and reliability. The m5UPred web server is freely accessible now and it should make a useful tool for the researchers who are interested in m 5 U RNA modification.

show abstract

“…In cross-validation, even with several folds, it is highly probable to obtain falsely higher classification accuracy. These overly optimistic results are explained by this close similarity of some groups of examples [55], and lack of statistical independence between tests, as the same samples are used in several evaluations [11].…”

Section: Plos Onementioning

confidence: 99%

Discretisation of conditions in decision rules induced for continuous data

2020

Self Cite

View full text Add to dashboard Cite

Typically discretisation procedures are implemented as a part of initial pre-processing of data, before knowledge mining is employed. It means that conclusions and observations are based on reduced data, as usually by discretisation some information is discarded. The paper presents a different approach, with taking advantage of discretisation executed after data mining. In the described study firstly decision rules were induced from real-valued features. Secondly, data sets were discretised. Using categories found for attributes, in the third step conditions included in inferred rules were translated into discrete domain. The properties and performance of rule classifiers were tested in the domain of stylometric analysis of texts, where writing styles were defined through quantitative attributes of continuous nature. The performed experiments show that the proposed processing leads to sets of rules with significantly reduced sizes while maintaining quality of predictions, and allows to test many data discretisation methods at the acceptable computational costs.

show abstract

Comparison of Cross-Validation and Test Sets Approaches to Evaluation of Classifiers in Authorship Attribution Domain

Cited by 17 publications

References 10 publications

Automatic Authorship Classification for German Lyrics Using Naïve Bayes

Automatic Authorship Classification for German Lyrics Using Naïve Bayes

m5UPred: A Web Server for the Prediction of RNA 5-Methyluridine Sites from Sequences

Discretisation of conditions in decision rules induced for continuous data

Contact Info

Product

Resources

About