Author Identification Using Imbalanced and Limited Training Texts

Ayadi, Wassim; Arour, Khedija

doi:10.1109/dexa.2007.5

Cited by 52 publications

(39 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The majority of the authorship attribution approaches studies present experiments based on balanced training sets (i.e., equal amount of training text samples for each candidate author) so it is not possible to estimate their accuracy under class imbalance conditions. Only a few studies take this factor into account (Marton, et al, 2005;Stamatatos, 2007).…”

Section: Cng and Variantsmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

Self Cite

1,062

822

View full text Add to dashboard Cite

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology provided it is able to handle short and noisy text from multiple candidate authors. In this paper, a survey of recent advances of the automated approaches to attributing authorship is presented examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.

show abstract

Section: Cng and Variantsmentioning

confidence: 99%

A survey of modern authorship attribution methods

Stamatatos

2008

J. Am. Soc. Inf. Sci.

Self Cite

1,062

822

View full text Add to dashboard Cite

show abstract

“…Typically the features are selected based on their frequency of appearance in the profile. Examples of a profile based approach include [10,20,11].…”

Section: Introductionmentioning

confidence: 99%

The Use of Orthogonal Similarity Relations in the Prediction of Authorship

Sapkota

Solorio

Montes-y-Gómez

et al. 2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Recent work on Authorship Attribution (AA) proposes the use of meta characteristics to train author models. The meta characteristics are orthogonal sets of similarity relations between the features from the different candidate authors. In that approach, the features are grouped and processed separately according to the type of information they encode, the so called linguistic modalities. For instance, the syntactic, stylistic and semantic features are each considered different modalities as they represent different aspects of the texts. The assumption is that the independent extraction of meta characteristics results in more informative feature vectors, that in turn result in higher accuracies. In this paper we set out to the task of studying the empirical value of this modality specific process. We experimented with different ways of generating the meta characteristics on different data sets with different numbers of authors and genres. Our results show that by extracting the meta characteristics from splitting features by their linguistic dimension we achieve consistent improvement of prediction accuracy.

show abstract

“…Although the dimensionality of the problem is increased in comparison to a function word approach, it is much smaller in comparison to a word n-gram approach. Methods based on such features have produced very good results in several author identification experiments and texts in various languages [17,16,30,11]. However, there is still no consensus about the definition of an appropriate n value (the length of character n-grams) for certain natural languages and text types.…”

Section: Previous Workmentioning

confidence: 99%

“…Another way to represent text is by using character n-gram frequencies [17,30]. Again the most frequent character n-grams (n contiguous characters) include the most important information.…”

Section: Previous Workmentioning

confidence: 99%

“…et al [25] and Hirst & Feiguina [12] examine the effectiveness of author identification methods under limited training text conditions. Stamatatos [30] proposes a model for handling limited and imbalanced training texts. In another study, Stamatatos [31] proposes text sampling methods for re-balancing an imbalanced training corpus to improve author identification performance.…”

Section: Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

Tensor Space Models for Authorship Identification

Plakias

Stamatatos

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

show abstract

Author Identification Using Imbalanced and Limited Training Texts

Cited by 52 publications

References 8 publications

A survey of modern authorship attribution methods

A survey of modern authorship attribution methods

The Use of Orthogonal Similarity Relations in the Prediction of Authorship

Tensor Space Models for Authorship Identification

Contact Info

Product

Resources

About