Authorship detection of SMS messages using unigrams

Ragel, Roshan; Herath, P. M. G. D. M.; Senanayake, Upul

doi:10.1109/iciinfs.2013.6732015

Cited by 18 publications

(15 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A large volume of literature is available on author identification for long documents and little literature exists on using author identification for short texts and focused upon different messaging systems. Most of the techniques used in this area have focused only on identifying users' stylometry in individual messaging systems [12,[16][17][18][19][20][21]. Some researchers in this field have focused only on using the relationship with the same user's stylometry linked to different messaging systems through a technique known as "linkability" [22]; for example, linking the user's stylometry based on a user profile.…”

Section: Author Identificationmentioning

confidence: 99%

See 1 more Smart Citation

Surveying the Development of Authorship Identification of Text Messages

Altamimi¹,

Alotaibi²,

Alruban³

2019

IJICR

View full text Add to dashboard Cite

People typically use multiple messaging systems and send text messages concurrently by different messaging systems such as SMS, email, Twitter, and Facebook. This means that these forms of messaging systems are highly integrated into people's everyday activities. Moreover, forensic science has had many challenges in many different kinds of crimes, differing from physical crimes to computer mediated crime activities. Identification of suspects has become crucial for law enforcement due in particularly to the anonymity that the internet and associated services provide and identify the ownership of messages. In this paper, survey the development of existing author identification have been done by establishing a comprehensive review of author identification to determining the core approaches and techniques utilized. Furthermore, this study also examines possible challenges in author identification and points to some of the open problems which need to be tackled.

show abstract

Section: Author Identificationmentioning

confidence: 99%

“…Ragel et al [21] focused specifically on authorship detection of SMS to identify authorship using unigrams as features. They stated that the length of the SMS is limited to 140 characters.…”

Section: Stylometric Features On Short Textmentioning

confidence: 99%

Surveying the Development of Authorship Identification of Text Messages

Altamimi¹,

Alotaibi²,

Alruban³

2019

IJICR

View full text Add to dashboard Cite

show abstract

“…The researchers applied likelihood ratio for authorship analysis using N -gram approach (Ishihara 2011, 2014), vocabulary richness and lexical features (Ishihara 2012), and got the best result of their system in terms of log-likelihood-ratio cost that was 0.46. Ragel, Herath and Senanayake (2013) also utilized NUS SMS corpus for identifying the best experimental conditions for authorship detection/identification. For this purpose, they created a profile of each author treating it as a known author and at the same time creating a similar profile from testing data and treating it as unknown.…”

Section: Related Workmentioning

confidence: 99%

Multilingual SMS-based author profiling: Data and methods

Fatima¹,

Anwar²,

Naveed³

et al. 2018

Nat. Lang. Eng.

View full text Add to dashboard Cite

In the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 andF1score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

show abstract

“…As a scientific endeavour it dates back at least to the 19th century [104,112], and was formulated as a computational task in the 1960s [118,163]. In contemporary work, the traditional focus on literary documents has largely been overshadowed by the increased use of online datasets, such as blog posts [121], e-mails [32,37], forum discussions [183], SMS messages [138], and tweets [25]. Neal et al [123] comprehensively survey the state-of-the-art in stylometry.…”

Section: Introductionmentioning

confidence: 99%

Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace?

Gröndahl¹,

Asokan²

2019

Preprint

View full text Add to dashboard Cite

Textual deception constitutes a major problem for online security. Many studies have argued that deceptiveness leaves traces in writing style, which could be detected using text classification techniques. By conducting an extensive literature review of existing empirical work, we demonstrate that while certain linguistic features have been indicative of deception in certain corpora, they fail to generalize across divergent semantic domains. We suggest that deceptiveness as such leaves no content-invariant stylistic trace, and textual similarity measures provide superior means of classifying texts as potentially deceptive. Additionally, we discuss forms of deception beyond semantic content, focusing on hiding author identity by writing style obfuscation. Surveying the literature on both author identification and obfuscation techniques, we conclude that current style transformation methods fail to achieve reliable obfuscation while simultaneously ensuring semantic faithfulness to the original text. We propose that future work in style transformation should pay particular a ention to disallowing semantically drastic changes.

show abstract

Authorship detection of SMS messages using unigrams

Cited by 18 publications

References 15 publications

Surveying the Development of Authorship Identification of Text Messages

Surveying the Development of Authorship Identification of Text Messages

Multilingual SMS-based author profiling: Data and methods

Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace?

Contact Info

Product

Resources

About