Proceedings of the 22nd International Conference on Computational Linguistics - COLING '08 2008
DOI: 10.3115/1599081.1599146
|View full text |Cite
|
Sign up to set email alerts
|

Authorship attribution and verification with many authors and limited data

Abstract: Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
77
0
2

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 94 publications
(81 citation statements)
references
References 22 publications
2
77
0
2
Order By: Relevance
“…We know from state of the art research in AA that the length of the documents and the number of potential candidate authors have an important effect on the accuracy of AA approaches (Moore, 2001;Luyckx and Daelemans, 2008;Luyckx and Daelemans, 2010). We can also point out the most common features that have been used successfully in AA work, including: bag-of-words (Madigan et al, 2005;Stamatatos, 2006), stylistic features (Zheng et al, 2006;Stamatatos et al, 2000), and word and character level n-grams (Kjell et al, 1994;Peng et al, 2003;Juola, 2006).…”
Section: Introductionmentioning
confidence: 99%
“…We know from state of the art research in AA that the length of the documents and the number of potential candidate authors have an important effect on the accuracy of AA approaches (Moore, 2001;Luyckx and Daelemans, 2008;Luyckx and Daelemans, 2010). We can also point out the most common features that have been used successfully in AA work, including: bag-of-words (Madigan et al, 2005;Stamatatos, 2006), stylistic features (Zheng et al, 2006;Stamatatos et al, 2000), and word and character level n-grams (Kjell et al, 1994;Peng et al, 2003;Juola, 2006).…”
Section: Introductionmentioning
confidence: 99%
“…A variety of performance measures have been used in previous work on this task including false acceptance and false rejection rates [60,17], accuracy [25,26], recall, precision, F 1 [30], balanced error rate [19], recall-precision graphs [26] macro-average precision and recall [1], and ROC graphs [22]. Unfortunately, these measures are not able to explicitly estimate the ability of an approach to leave problems unanswered-a fact which is crucial in a cost-sensitive task like this.…”
Section: Related Workmentioning
confidence: 99%
“…Previous work on author verification has been evaluated using sample texts in one language only (Greek [60], Dutch [17,30], English [25,26]) and a specific genre (newspaper articles [60], student essays [30], fiction [25], newswire stories [19], poems [19], blogs [26]). Author verification was also included in previous editions of PAN: the author identification task at PAN-2011 included three author verification problems [1], PAN-2013 focused on author verification and provided corpora in English, Greek, and Spanish [22].…”
Section: Related Workmentioning
confidence: 99%
“…Most of the work on AV has focused on developing specific features (stylometric, lexical, character-level, syntactic, semantic) able to characterize the writing style of authors, thus putting emphasis on feature extraction and selection [7,11,10,1], see [13] for a comprehensive review. However, despite these features can be helpful for obtaining reliable models, extracting such features from raw text is a rather complex and time consuming process.…”
Section: Related Workmentioning
confidence: 99%