Authorship attribution and verification with many authors and limited data

Luyckx, Kim; Daelemans, Walter

doi:10.3115/1599081.1599146

Cited by 94 publications

(81 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We know from state of the art research in AA that the length of the documents and the number of potential candidate authors have an important effect on the accuracy of AA approaches (Moore, 2001;Luyckx and Daelemans, 2008;Luyckx and Daelemans, 2010). We can also point out the most common features that have been used successfully in AA work, including: bag-of-words (Madigan et al, 2005;Stamatatos, 2006), stylistic features (Zheng et al, 2006;Stamatatos et al, 2000), and word and character level n-grams (Kjell et al, 1994;Peng et al, 2003;Juola, 2006).…”

Section: Introductionmentioning

confidence: 99%

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

Sapkota

Bethard

Montes

et al. 2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

118

117

View full text Add to dashboard Cite

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate the predictiveness of each of these groups in two AA settings: a single domain setting and a cross-domain setting where multiple topics are present. We demonstrate that character ngrams that capture information about affixes and punctuation account for almost all of the power of character n-grams as features. Our study contributes new insights into the use of n-grams for future AA work and other classification tasks.

show abstract

Section: Introductionmentioning

confidence: 99%

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

Sapkota

Bethard

Montes

et al. 2015

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

118

117

View full text Add to dashboard Cite

show abstract

“…A variety of performance measures have been used in previous work on this task including false acceptance and false rejection rates [60,17], accuracy [25,26], recall, precision, F 1 [30], balanced error rate [19], recall-precision graphs [26] macro-average precision and recall [1], and ROC graphs [22]. Unfortunately, these measures are not able to explicitly estimate the ability of an approach to leave problems unanswered-a fact which is crucial in a cost-sensitive task like this.…”

Section: Related Workmentioning

confidence: 99%

“…Previous work on author verification has been evaluated using sample texts in one language only (Greek [60], Dutch [17,30], English [25,26]) and a specific genre (newspaper articles [60], student essays [30], fiction [25], newswire stories [19], poems [19], blogs [26]). Author verification was also included in previous editions of PAN: the author identification task at PAN-2011 included three author verification problems [1], PAN-2013 focused on author verification and provided corpora in English, Greek, and Spanish [22].…”

Section: Related Workmentioning

confidence: 99%

Improving the Reproducibility of PAN’s Shared Tasks:

Potthast

Gollub

Rangel

et al. 2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Potthast, M.; Gollub, T.; Rangel, F.; Rosso, P.; Stamatatos, E.; Abstract This paper reports on the PAN 2014 evaluation lab which hosts three shared tasks on plagiarism detection, author identification, and author profiling.To improve the reproducibility of shared tasks in general, and PAN's tasks in particular, the Webis group developed a new web service called TIRA, which facilitates software submissions. Unlike many other labs, PAN asks participants to submit running softwares instead of their run output. To deal with the organizational overhead involved in handling software submissions, the TIRA experimentation platform helps to significantly reduce the workload for both participants and organizers, whereas the submitted softwares are kept in a running state. This year, we addressed the matter of responsibility of successful execution of submitted softwares in order to put participants back in charge of executing their software at our site. In sum, 57 softwares have been submitted to our lab; together with the 58 software submissions of last year, this forms the largest collection of softwares for our three tasks to date, all of which are readily available for further analysis. The report concludes with a brief summary of each task.

show abstract

“…Most of the work on AV has focused on developing specific features (stylometric, lexical, character-level, syntactic, semantic) able to characterize the writing style of authors, thus putting emphasis on feature extraction and selection [7,11,10,1], see [13] for a comprehensive review. However, despite these features can be helpful for obtaining reliable models, extracting such features from raw text is a rather complex and time consuming process.…”

Section: Related Workmentioning

confidence: 99%

Particle Swarm Model Selection for Authorship Verification

Montes

Villaseñor

2009

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

View full text Add to dashboard Cite

Abstract. Authorship verification is the task of determining whether documents were or were not written by a certain author. The problem has been faced by using binary classifiers, one per author, that make individual yes/no decisions about the authorship condition of documents. Traditionally, the same learning algorithm is used when building the classifiers of the considered authors. However, the individual problems that such classifiers face are different for distinct authors, thus using a single algorithm may lead to unsatisfactory results. This paper describes the application of particle swarm model selection (PSMS) to the problem of authorship verification. PSMS selects an ad-hoc classifier for each author in a fully automatic way; additionally, PSMS also chooses preprocessing and feature selection methods. Experimental results on two collections give evidence that classifiers selected with PSMS are advantageous over selecting the same classifier for all of the authors involved.

show abstract

Authorship attribution and verification with many authors and limited data

Cited by 94 publications

References 22 publications

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

Improving the Reproducibility of PAN’s Shared Tasks:

Particle Swarm Model Selection for Authorship Verification

Contact Info

Product

Resources

About