Short text authorship attribution via sequence kernels, Markov chains and author unmasking

Sanderson, Conrad; Guenter, Simon

doi:10.3115/1610075.1610142

Cited by 110 publications

(90 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This would involve from simple routines like conversion to lowercase to more complex tools like stemmers (Sanderson & Guenter, 2006), lemmatizers (Tambouratzis, Markantonatou, Hairetakis, Vassiliou, Carayannis, & Tambouratzis, 2004;Gamon, 2004), or detectors of common homographic forms (Burrows, 2002). Another procedure used by van Halteren (2007) is to transform words into an abstract form.…”

Section: Lexical Featuresmentioning

confidence: 99%

“…The problem of defining a fixed value for n can be avoided by the extraction of n-grams of variable-length (Forsyth & Holmes, 1996;Houvardas & Stamatatos, 2006). Sanderson and Guenter (2006) described the use of several sequence kernels based on character n-grams of variable-length and the best results for short English texts were achieved when examining sequences of up to 4-grams. Moreover, various Markov models of variable order have been proposed for handling character-level information (Khmelev & Teahan, 2003a;Marton, et al, 2005).…”

Section: Character Featuresmentioning

confidence: 99%

See 1 more Smart Citation

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,063

822

View full text Add to dashboard Cite

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

show abstract

Section: Lexical Featuresmentioning

confidence: 99%

Section: Character Featuresmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,063

822

View full text Add to dashboard Cite

show abstract

“…This may be accounted for by using a probabilistic distance measure such as K-L divergence between Markov model probability distributions of the texts (Juola 1998;Khmelev 2001;Khmelev and Tweedie 2002;Juola & Baayen 2003;Sanderson and Guenter 2006), possibly implicitly in the context of compression methods (Kukushkina et al 2001;Benedetto et al 2002;Khmelev and Teahan 2003;Marton et al 2005).…”

Section: Multivariate Analysis Approachmentioning

confidence: 99%

“…Finally, one limitation of unmasking that should be noted is that it requires a large amount of training text (Sanderson and Guenter 2006) ; preliminary tests suggest that the minimum would be in the area of 5000 to 10,000 words..…”

Section: • I Th Highest Accuracy Drop In Two Iterationsmentioning

confidence: 99%

Computational methods in authorship attribution

Koppel

Schler

Argamon

2008

J. Am. Soc. Inf. Sci.

453

278

View full text Add to dashboard Cite

Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

show abstract

“…He concludes that the best method uses many short text samples for minority classes and less but longer ones for the majority classes. [18] observed that the amount of training material has more influence on performance than the amount of test material. In order to obtain reliable performance, they find that 5,000 words in training can be considered a minimum requirement.…”

Section: Literature Surveymentioning

confidence: 99%

Authorship Attribution on Imbalanced English Editorial Corpora

Rao¹,

Raju²,

Kumar³

2017

IJCA

View full text Add to dashboard Cite

Authorship attribution is one of the important problem, with many applications of practical use in the real-world. Authorship identification determines the likelihood of a piece of writing produced by a particular author by examining the other writings of that author. Every author has a unique style of writing pattern. This paper identifies the unique style of an author(s) using lexical stylometric features including function words using balanced training corpus. The present paper calculates the frequencies of the lexical based stylometric features by balancing training and test corpus on English editorial documents. The present paper compares various machine learning algorithms for the authorship attribution and achieved highest average accuracy 95.58 using Random Forest classifier and 92.59 using Multilayer Perceptron algorithms.

show abstract

Short text authorship attribution via sequence kernels, Markov chains and author unmasking

Cited by 110 publications

References 18 publications

A survey of modern authorship attribution methods

A survey of modern authorship attribution methods

Computational methods in authorship attribution

Authorship Attribution on Imbalanced English Editorial Corpora

Contact Info

Product

Resources

About