On Authorship Attribution via Markov Chains and Sequence Kernels

Sanderson, Conrad; Guenter, Simon

doi:10.1109/icpr.2006.899

Cited by 11 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The idea was motivated by previous works using -gram models to discriminate text into categories according to genre [32], [33], [34], [35], authorship [32], [36], sentiment [37], [38], language [39], etc. We believe that such discriminative capability can be also exhibited by the TD and TO model components.…”

Section: B Text Classificationmentioning

confidence: 99%

Decoupling Word-Pair Distance and Co-occurrence Information for Effective Long History Context Language Modeling

Chong

Banchs

Chng

et al. 2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In this paper, we propose the use of distance and co-occurrence information of word-pairs to improve language modeling. We have empirically shown that, for history-context sizes of up to ten words, the extracted information about distance and co-occurrence complements the -gram language model well, for which learning long-history contexts is inherently difficult. Evaluated on the Wall Street Journal and the Switchboard corpora, our proposed model reduces the trigram model perplexity by up to 11.2% and 6.5%, respectively. As compared to the distant bigram model and the trigger model, our proposed model offers a more effective manner of capturing far context information, as verified in terms of perplexity and computational efficiency, i.e., fewer free parameters to be fine-tuned. Experiments using the proposed model for speech recognition, text classification and word prediction tasks showed improved performance.

show abstract

Section: B Text Classificationmentioning

confidence: 99%

Decoupling Word-Pair Distance and Co-occurrence Information for Effective Long History Context Language Modeling

Chong

Banchs

Chng

et al. 2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…If we consider simplicity and language independence as primary factors, lexical features are expected to perform better than other features. Especially, the character n ‐gram representation has been used as one of the most effective measures of authorship attribution [13, 15]. If authors tend to use similar patterns in their writings, this would imply that syntactic and semantic features may lead to superior results.…”

Section: Related Workmentioning

confidence: 99%

Chat biometrics

Kuzu

Salah

2018

IET biom.

View full text Add to dashboard Cite

On-line social platforms implement moderation mechanisms to filter out unwanted content and to take action against possible cases of verbal aggression and abuse, sexual harassment, and such. In this study, the authors investigate chat biometrics, the identification of users from their verbal behaviour on a social platform. The typical application scenarios are the re-identification of banned users, returning under different identities, and aggressors operating through multiple fake accounts. They propose a novel processing pipeline, and contrast the problem with the authorship recognition problem, which is relatively well-studied in the literature. They evaluate the proposed approach on a large corpus of multiparty chat records in Turkish, which they have previously collected from a multiplayer game environment. They also introduce a new corpus in this study, collected from a well-known Turkish social platform called Ekşisözlük, in order to test the robustness of the system across domain changes, as well as on Portuguese and English news datasets to test it on different languages. They evaluate both instance-based and profile-based approaches, and provide detailed analyses with regards to the required amount of text to identify a person reliably.

show abstract

“…Let S t denote the event that a section s [ {s 1 , …, s n } belongs to the target group (= not plagiarized); likewise, let S o denote the event that s belongs to the outlier Character n-gram frequency/ratio* Kjell et al (1994), Sanderson and Guenter (2006a), Juola (2006) and Koppel (2009) Average sentence length Holmes (1998) and Zheng et al (2006) Average number of syllables per word* Holmes (1998) Word frequency Mosteller and Wallace (1964), Holmes (1998) and Koppel (2009) Word n-grams frequency/ratio Sanderson and Guenter (2006a) Number of hapax legomena Tweedie and Baayen (1998) and Zheng et al…”

Section: Outlier Identificationmentioning

confidence: 99%

Intrinsic plagiarism analysis

Stein

Lipka

Prettenhofer

2010

Lang Resources & Evaluation

112

View full text Add to dashboard Cite

Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art.(2) We show how the meta learning approach of Koppel and Schler, termed ''unmasking'', can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning. Problem statementIn the following, the term plagiarism refers to text plagiarism, i.e., the use of another author's information, language, or writing, when done without proper acknowledgment of the original source. Plagiarism detection refers to the unveiling of text plagiarism. Existing approaches to computer-based plagiarism detection break down this task into manageable parts:''Given a text d and a reference collection D, does d contain a section s for which one can find a document d i [ D that contains a section s i such that under some retrieval model R the similarity u R between s and s i is above a threshold h?''Observe that research on automated plagiarism detection presumes a closed world where a reference collection D is given. Since D can be extremely largepossibly the entire indexed part of the World Wide Web-the main research focus is on efficient search technology: near-similarity search and near-duplicate detection (Brin et al

show abstract

On Authorship Attribution via Markov Chains and Sequence Kernels

Cited by 11 publications

References 11 publications

Decoupling Word-Pair Distance and Co-occurrence Information for Effective Long History Context Language Modeling

Decoupling Word-Pair Distance and Co-occurrence Information for Effective Long History Context Language Modeling

Chat biometrics

Intrinsic plagiarism analysis

Contact Info

Product

Resources

About