Jonathan Schler scite author profile

Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample.In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

show abstract

Automatically profiling the author of an anonymous text

Argamon

et al. 2009

View full text Add to dashboard Cite

Imagine that you have been given an important text of unknown authorship, and wish to know as much as possible about the unknown author (demographics, personality, cultural background, etc.), just by analyzing the given text. This authorship profiling problem is of growing importance in the current global information environmentapplications abound in forensics, security, and commercial settings. For example, authorship profiling can help police identify characteristics of the perpetrator of a crime when there are too few (or too many) specific suspects to consider. Similarly, large corporations may be interested in knowing what types of people like or dislike their products, based on analysis of blogs and online product reviews. The question we therefore ask is: How much can we discern about the author of a text simply by analyzing the text itself? It turns out that, with varying degrees of accuracy, we can say a great deal indeed. Unlike the problem of authorship attribution (determining the author of a text from a given candidate set), discussed recently in these pages by Li, Zheng, and Chen (2006), authorship profiling does not begin with a set of writing samples from known candidate authors. Instead, we exploit the sociolinguistic observation that different groups of people speaking or writing in a particular genre and in a particular language use that language differently (Chambers et al. 2004). That is, they vary in how often they use certain words or syntactic constructions (in addition to variation in, e.g., pronunciation or intonation). The particular profile dimensions we consider here are author gender (Argamon et al. 2003), age (Koppel et al. 2006), native language

show abstract

Authorship verification as a one-class classification problem

2004

View full text Add to dashboard Cite

In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process.

show abstract

Determining an author's native language by mining a text for errors

2005

View full text Add to dashboard Cite

show abstract

Authorship attribution in the wild

Koppel

Schler

Argamon

2010

Lang Resources & Evaluation

187

121

View full text Add to dashboard Cite

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.

show abstract

Mining the Blogosphere: Age, gender and the varieties of self-expression

Argamon¹,

Koppel²,

Pennebaker³

et al. 2007

157

118

View full text Add to dashboard Cite

The growth of the blogosphere offers an unprecedented opportunity to study language and how people use it on a large scale. We present an analysis of over 140 million words of English text drawn from the blogosphere, exploring if and how age and gender affect writing style and topic. Our primary result is that a number of stylistic and content-based indicators are significantly affected by both age and gender, and that the main difference between older and younger bloggers, and between male and female bloggers, lies in the extent to which their discourse is outer- or inner-directed. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age.

show abstract

The Importance of Neutral Examples for Learning Sentiment

Koppel

Schler

2006

Computational Intelligence

125

View full text Add to dashboard Cite

Most research on learning to identify sentiment ignores "neutral" examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples.

show abstract

Authorship attribution with thousands of candidate authors

Koppel

Schler

Argamon

et al. 2006

View full text Add to dashboard Cite

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jonathan Schler

Computational methods in authorship attribution

Automatically profiling the author of an anonymous text

Authorship verification as a one-class classification problem

Determining an author's native language by mining a text for errors

Authorship attribution in the wild

Mining the Blogosphere: Age, gender and the varieties of self-expression

The Importance of Neutral Examples for Learning Sentiment

Authorship attribution with thousands of candidate authors

Contact Info

Product

Resources

About