Authorship attribution with thousands of candidate authors

Koppel, Moshe; Schler, Jonathan; Argamon, Shlomo; Messeri, Eran

doi:10.1145/1148170.1148304

Cited by 67 publications

(49 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Beyond literature, several evaluation corpora for authorship attribution studies have been built covering certain text domains such as online newspaper articles (Stamatatos, et al, 2000;Diederich, et al, 2003;Luyckx & Daelemans, 2005;Sanderson & Guenter, 2006), e-mail messages (de Vel, et al, 2001;Koppel & Schler, 2003), online forum messages (Argamon, et al, 2003;Abbasi & Chen, 2005;Zheng, et al, 2006), newswire stories (Khmelev & Teahan, 2003a;Zhao & Zobel, 2005), blogs (Koppel, Schler, Argamon, & Messeri, 2006), etc. Alternatively, corpora built for other purposes have also been used in the framework of authorship attribution studies including parts of the Reuters-21578 corpus (Teahan & Harper, 2003;Marton, et al, 2005), the Reuters Corpus Volume 1 (Khmelev & Teahan, 2003a;Madigan, et al, 2005;Stamatatos, 2007) and the TREC corpus (Zhao & Zobel, 2005) that were initially built for evaluating thematic text categorization tasks.…”

Section: Discussionmentioning

confidence: 99%

“…Emphasis is now given to the objective evaluation of the proposed methods as well as the comparison of different methods based on common benchmark corpora (Juola, 2004). In addition, factors playing a crucial role in the accuracy of the produced models are examined, such as the training text size (Marton, Wu, & Hellerstein, 2005;Hirst & Feiguina, 2007), the number of candidate authors (Koppel, Schler, Argamon, & Messeri, 2006), and the distribution of training texts over the candidate authors (Stamatatos, 2008).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,060

822

View full text Add to dashboard Cite

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

1,060

822

View full text Add to dashboard Cite

show abstract

“…We use meta-learning to identify such cases and find that in the remaining cases, where the system believes attribution is reliable, we are able to provide highly accurate results. The discussion is Section 7 is an expansion of that given in Koppel et al (2006c).…”

Section: Variations On the Basic Attribution Problemmentioning

confidence: 99%

Computational methods in authorship attribution

Koppel

Schler

Argamon

2008

J. Am. Soc. Inf. Sci.

Self Cite

452

278

View full text Add to dashboard Cite

Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

show abstract

“…Experiments were conducted with support vector machine classifiers in twenty novels and success rates above 90% were obtained. The use of functional words is a valid and good approach in attribution of authorship [Koppel 2006]. A success rate of 65% and 72% has been measured in the study for authorship recognition, which is an implementation of multiple regression and discriminant analysis [Stamatatos et al, 2000].…”

Section: Related Workmentioning

confidence: 99%

Detection of Fraudulent Emails by Authorship Extraction

Pandian¹,

Karim²

2012

IJCA

View full text Add to dashboard Cite

Fraudulent emails can be detected by extraction of authorship information from the contents of emails. This paper presents information extraction based on unique words from the emails. These unique words will be used as representative features to train Radial Basis function (RBF). Final weights are obtained and subsequently used for testing. The percentage of identification of email authorship depends upon number of RBF centers and the type of functional words used for training RBF. One hundred and fifty authors with over one hundred files from the sent folder of Enron email dataset are considered. A total of 300 unique words of number of characters in each word ranging from three to seven are considered. Training and testing of RBF are done by taking different lengths of words. Our simulation shows the effectiveness of the proposed RBF network for email authorship identification. The accuracy of authorship identification ranges from 95% to 97%.

show abstract

Authorship attribution with thousands of candidate authors

Cited by 67 publications

References 4 publications

A survey of modern authorship attribution methods

A survey of modern authorship attribution methods

Computational methods in authorship attribution

Detection of Fraudulent Emails by Authorship Extraction

Contact Info

Product

Resources

About