Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing - EMNLP '06 2006
DOI: 10.3115/1610075.1610142
|View full text |Cite
|
Sign up to set email alerts
|

Short text authorship attribution via sequence kernels, Markov chains and author unmasking

Abstract: We present an investigation of recently proposed character and word sequence kernels for the task of authorship attribution based on relatively short texts. Performance is compared with two corresponding probabilistic approaches based on Markov chains. Several configurations of the sequence kernels are studied on a relatively large dataset (50 authors), where each author covered several topics. Utilising Moffat smoothing, the two probabilistic approaches obtain similar performance, which in turn is comparable … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
82
0
3

Year Published

2008
2008
2019
2019

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 110 publications
(90 citation statements)
references
References 18 publications
0
82
0
3
Order By: Relevance
“…This would involve from simple routines like conversion to lowercase to more complex tools like stemmers (Sanderson & Guenter, 2006), lemmatizers (Tambouratzis, Markantonatou, Hairetakis, Vassiliou, Carayannis, & Tambouratzis, 2004;Gamon, 2004), or detectors of common homographic forms (Burrows, 2002). Another procedure used by van Halteren (2007) is to transform words into an abstract form.…”
Section: Lexical Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…This would involve from simple routines like conversion to lowercase to more complex tools like stemmers (Sanderson & Guenter, 2006), lemmatizers (Tambouratzis, Markantonatou, Hairetakis, Vassiliou, Carayannis, & Tambouratzis, 2004;Gamon, 2004), or detectors of common homographic forms (Burrows, 2002). Another procedure used by van Halteren (2007) is to transform words into an abstract form.…”
Section: Lexical Featuresmentioning
confidence: 99%
“…The problem of defining a fixed value for n can be avoided by the extraction of n-grams of variable-length (Forsyth & Holmes, 1996;Houvardas & Stamatatos, 2006). Sanderson and Guenter (2006) described the use of several sequence kernels based on character n-grams of variable-length and the best results for short English texts were achieved when examining sequences of up to 4-grams. Moreover, various Markov models of variable order have been proposed for handling character-level information (Khmelev & Teahan, 2003a;Marton, et al, 2005).…”
Section: Character Featuresmentioning
confidence: 99%
“…This may be accounted for by using a probabilistic distance measure such as K-L divergence between Markov model probability distributions of the texts (Juola 1998;Khmelev 2001;Khmelev and Tweedie 2002;Juola & Baayen 2003;Sanderson and Guenter 2006), possibly implicitly in the context of compression methods (Kukushkina et al 2001;Benedetto et al 2002;Khmelev and Teahan 2003;Marton et al 2005).…”
Section: Multivariate Analysis Approachmentioning
confidence: 99%
“…Finally, one limitation of unmasking that should be noted is that it requires a large amount of training text (Sanderson and Guenter 2006) ; preliminary tests suggest that the minimum would be in the area of 5000 to 10,000 words..…”
Section: • I Th Highest Accuracy Drop In Two Iterationsmentioning
confidence: 99%
“…He concludes that the best method uses many short text samples for minority classes and less but longer ones for the majority classes. [18] observed that the amount of training material has more influence on performance than the amount of test material. In order to obtain reliable performance, they find that 5,000 words in training can be considered a minimum requirement.…”
Section: Literature Surveymentioning
confidence: 99%