Cross-Genre Authorship Verification Using Unmasking

Kestemont, Mike; Luyckx, Kim; Daelemans, Walter; Crombez, Thomas

doi:10.1080/0013838x.2012.668793

Cited by 46 publications

(40 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Based on another small corpus (2 authors and 3 topics) Madigan, et al (2005) demonstrated that POS features are more effective than word unigrams in crosstopic conditions. The unmasking method for author verification of long documents based on very frequent word frequencies was successfully tested in cross-topic conditions (Koppel et al, 2007) but Kestemont, et al (2012) found that its reliability was significantly lower in cross-genre conditions. Function words have been found to be effective when topics of the test corpus are excluded from the training corpus (Baayen et al, 2002;Goldstein-Stewart et al, 2009;Menon and Choi, 2011).…”

Section: Related Workmentioning

confidence: 99%

“…In most applications, there are certain restrictions that do not allow the construction of a representative training corpus. Unlike other text categorization tasks, a recent trend in authorship attribution research is to build cross-genre and crosstopic models, meaning that the training and test corpora do not share the same properties (Kestemont et al, 2012;Stamatatos, 2013;Sapkota et al, 2014;Stamatatos et al, 2015).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Authorship Attribution Using Text Distortion

Stamatatos¹

2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 1

View full text Add to dashboard Cite

Authorship attribution is associated with important applications in forensics and humanities research. A crucial point in this field is to quantify the personal style of writing, ideally in a way that is not affected by changes in topic or genre. In this paper, we present a novel method that enhances authorship attribution effectiveness by introducing a text distortion step before extracting stylometric measures. The proposed method attempts to mask topicspecific information that is not related to the personal style of authors. Based on experiments on two main tasks in authorship attribution, closed-set attribution and authorship verification, we demonstrate that the proposed approach can enhance existing methods especially under cross-topic conditions, where the training and test corpora do not match in topic.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Authorship Attribution Using Text Distortion

Stamatatos¹

2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 1

View full text Add to dashboard Cite

show abstract

“…We tested whether the approach works for the cross-genre authorship verification task in the expectation that the genre markers would be limited and superficial and would therefore be among the first to be discarded in the unmasking approach, leading to a clear degradation curve indicative of same authorship. We refer to the paper [23] for a detailed description of the operationalization of the unmasking approach to our crossgenre case. We applied the approach to theatre and prose texts of five authors.…”

Section: Cross-genre Stylometrymentioning

confidence: 99%

“…In a recent study [23] we tackled both the problem of verification (rather than attribution, i.e. the open case) and the problem of cross-genre generalization.…”

Section: Cross-genre Stylometrymentioning

confidence: 99%

Explanation in Computational Stylometry

Daelemans

2013

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Computational stylometry, as in authorship attribution or profiling, has a large potential for applications in diverse areas: literary science, forensics, language psychology, sociolinguistics, even medical diagnosis. Yet, many of the basic research questions of this field are not studied systematically or even at all. In this paper we will go into these problems, and suggest that a reinterpretation of current and historical methods in the framework and methodology of machine learning of natural language processing would be helpful. We also argue for more attention in research for explanation in computational stylometry as opposed to purely quantitative evaluation measures and propose a strategy for data collection and analysis for achieving progress in computational stylometry. We also introduce a fairly new application of computational stylometry in internet security. Meta-knowledge Extraction from TextThe form of a text is determined by many factors. Content plays a role (the topic of a text determines in part its vocabulary), text type (genre, register) is important and will determine part of the writing style, but also psychological and sociological aspects of the author of the text will be sources of stylistic language variation. These psychological factors include personality, mental health, and being a native speaker or not; sociological factors include age, gender, education level, and region of language acquisition.Writing style is a combination of consistent decisions in language production at different linguistic levels (lexical choice, syntactic structures, discourse coherence, ...) that is linked to specific authors or author groups such as male authors or teenage authors. It remains to be seen whether this link is consistent over time and whether there are style features that are unconscious and cannot be controlled, as some researchers have argued. The basic research question for computational stylometry seems then to describe and explain the causal relations between psychological and sociological properties of authors on the one hand, and their writing style on the other. These theories can be used to develop systems that generate text in a particular style, or perhaps more usefully, systems that detect the identity of authors (authorship attribution and verification) or some of their psychological or sociological properties (profiling) from text.A limit hypothesis arising from this definition is that style is unique for an individual, like her fingerprint, earprint or genome. This has been called the human stylome hypothesis:

show abstract

“…In authorship studies, there is nowadays a general consensus that features related to style are more useful (Juola, 2006;Koppel et al, 2009;Stamatatos, 2009b), since topical, content-related features vary much more strongly across the documents 40 authored by a single individual. Much research nowadays therefore concerns ways to effectively extract stylistic characteristics from documents that are not affected by a text's specific content or genre (Argamon & Levitan, 2005;Kestemont et al, 2012;Efstathios, 2013;Sapkota et al, 2015;Seroussi et al, 2014;Sapkota et al, 2014). This has not always been the case: historical practitioners in earlier centuries, commonly based attributions on a much looser defined set of linguistic criteria, including, for instance, 45 the use of conspicuous, rare words (Love, 2002;Kestemont, 2014).…”

mentioning

confidence: 99%

Authenticating the writings of Julius Caesar

Kestemont

Stover

Koppel

et al. 2016

Expert Systems with Applications

Self Cite

View full text Add to dashboard Cite

In this paper, we shed new light on the authenticity of the Corpus Caesarianum, a group of five commentaries describing the campaigns of Julius Caesar (100-44 BC), the founder of the Roman empire. While Caesar himself has authored at least part of these commentaries, the authorship of the rest of the texts remains a puzzle that has persisted for nineteen centuries. In particular, the role of Caesar's general Aulus Hirtius, who has claimed a role in shaping the corpus, has remained in contention. Determining the authorship of documents is an increasingly important authentication problem in information and computer science, with valuable applications, ranging from the domain of art history to counter-terrorism research.We describe two state-of-the-art authorship verification systems and benchmark them on 6 present-day evaluation corpora, as well as a Latin benchmark dataset. Regarding Caesar's writings, our analysis allow us to establish that Hirtius's claims to part of the corpus must be considered legitimate. We thus demonstrate how computational methods constitute a valuable methodological complement to traditional, expert-based approaches to document authentication.

show abstract

Cross-Genre Authorship Verification Using Unmasking

Cited by 46 publications

References 15 publications

Authorship Attribution Using Text Distortion

Authorship Attribution Using Text Distortion

Explanation in Computational Stylometry

Authenticating the writings of Julius Caesar

Contact Info

Product

Resources

About