Information extraction from spam emails using stylistic and semantic features to identify spammers

Halder, Soma; Tiwari, Rajive; Sprague, Alan P.

doi:10.1109/iri.2011.6009529

Cited by 11 publications

(15 citation statements)

References 3 publications

(4 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, the Trfr(Term frequency) and Indfr(Inverse Document frequency) for the top n most frequent words used in the dataset and second, the count of the top n bigrams used in the dataset, where n is the number that is decided based upon the cutoff of the minimum frequency count. Trfr & Indfr is a statistical measure that can be used to represent the importance of a term in a document [1]. It first remove all the stop words from the emails and then Trfr-Indfr can be calculated.…”

Section: Semantic Parametersmentioning

confidence: 99%

“…It is required to measure cluster quality. The purity percentage is evaluated by following equation [1]:…”

Section: Puritymentioning

confidence: 99%

“…During analysis [1], system extracts some features from spam emails and clusters them according to their parameter similarity. From these clusters spam domains are identified.…”

Section: Introductionmentioning

confidence: 99%

“…According to Halder et al [1], analyzed spam emails have similarity in styles and semantics of them. They proposed that spammer can be recognized by clustering identical spam emails according to stylistic, semantic and combined features.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Recognizing Spam Domains by Extracting Features from Spam Emails using Data Mining

Patel¹

2014

IJCA

View full text Add to dashboard Cite

This paper attempts to develop an algorithm to recognize spam domains using data mining techniques with the focus on law enforcement forensic analysis. Spam filtering has been the major weapon against spam, but failed to reduce the number of spam emails sent to an indiscriminate set of recipients. The proposed algorithm accepts as input, spam mails of personal account and extracts features such as stylistic, semantic, related email subjects and URLs present in the emails. The individual features are then clustered and evaluated. Further, these clusters are mapped with their respective domains. These spam domains are the URL of the webpage that spammer is trying to promote. The WHOIS information of the domain helps to get information about the source of that domain. Parameters like overall purity and the number of emails present in the cluster with highest purity is used to measure result of the individual features. An Experimental result shows that clustering of spam mails by stylistic and semantic parameter 20% less pure than other two features of spam mails.

show abstract

Section: Semantic Parametersmentioning

confidence: 99%

“…It is required to measure cluster quality. The purity percentage is evaluated by following equation [1]:…”

Section: Puritymentioning

confidence: 99%

“…During analysis [1], system extracts some features from spam emails and clusters them according to their parameter similarity. From these clusters spam domains are identified.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Recognizing Spam Domains by Extracting Features from Spam Emails using Data Mining

Patel¹

2014

IJCA

View full text Add to dashboard Cite

show abstract

“…Various approaches to feature extraction have been researched, with mixed results. Recent developments suggest that both semantic and statistical features can be used to cluster email text [10], and this technique is explored further in this paper.…”

Section: Definitions and Problem Statementmentioning

confidence: 99%

Semantics based multi-layered networks for spam email detection

Creech

Jiang

2012

AIP Conference Proceedings

View full text Add to dashboard Cite

Articles you may be interested inQuantum control of a model qubit based on a multi-layered quantum dot Abstract. In this paper, we propose a hybrid semantic and statistics approach for spam email detection. An adaptive training scheme is implemented, which does not require a large training pool. Experimental results have shown promising performance. INTRODUCTIONUnsolicited e-mail has become the bane of modern life. The impact of spam emails on business, government, and general activity ranges from irritating to severely deleterious. Spam has two main effects: first, it clogs the Internet, reducing effective throughput and slowing the transmission of genuine communications; second, there is significant aggregate time wasted in deleting such e-mails. Whilst it may only take a few seconds for one individual to remove one spam e-mail, the total time squandered when this same e-mail is deleted by millions of recipients becomes significant, with consequent reduction in productivity and increasingly drain on IT resources. In an attempt to combat the effect of spam, numerous email filters are currently deployed on most operating systems and ISPs. Tuning these spam filters is a complex task, as the impact of a false positive, and the consequent blocking of the bona fide e-mail, has a much greater impact on the individual user than a false negative and the subsequent transmission of a spam e-mail. End users are hence reluctant to use overly aggressive spam filters, regardless of the overall impact on Internet traffic. Significant gains in global efficiency could be provided by a highly accurate spam filter.This paper introduces a hybrid approach to the spam problem, utilising a combined semantic and statistical text based feature extraction methodology, it is coupled with three multilayer feedforward back propagation neural networks operating in a combined voting architecture. A level of self training was implemented using high confidence unanimous decisions to further develop the network beyond its initial training set. Significantly, good accuracy was achieved on previously unseen e-mail test cases using only 11 features and an initial training set of only 200 elements. The adaptive nature of this system coupled with the small initial training set suggests that a full implementation would perform well in dynamic environments, with the neural network topology lending itself to possible hardware implementations as explored in [1,2]. The voting methodology adopted by this algorithm significantly mitigates the training overhead traditionally required by neural networks, as each network in this approach is only required to find a local minima, rather than converging to the elusive global result of a perfect system. Errors arising in any one neural network are generally accounted for by correct classifications from the other two networks, resulting in an aggregate system with a high accuracy rate, low initial training requirements, and a self reinforcing adaptive learning capability. This paper is organised as follows: in Curr...

show abstract

What Leads to Effective Online Physician-Patient Communication? the Power of Convergence

Wang

Zhang

Meng

2023

Lecture Notes in Business Information Processing

View full text Add to dashboard Cite

Information extraction from spam emails using stylistic and semantic features to identify spammers

Abstract: Abstract

Cited by 11 publications

References 3 publications

Recognizing Spam Domains by Extracting Features from Spam Emails using Data Mining

Recognizing Spam Domains by Extracting Features from Spam Emails using Data Mining

Semantics based multi-layered networks for spam email detection

What Leads to Effective Online Physician-Patient Communication? the Power of Convergence

Contact Info

Product

Resources

About