An Assessment of Case-Based Reasoning for Spam Filtering

Delany, Sarah Jane; Cunningham, Pádraig; Coyle, Lorcan

doi:10.1007/s10462-005-9006-6

Cited by 57 publications

(45 citation statements)

References 9 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weighted nearest neighbor weights each of the k nearest examples by its similarity to x and compares the sum of the weights to a threshold. Several authors consider the use of clustering and kNN methods for spam filtering, but none report strong performance [6,47,143,144,197].…”

Section: Nearest Neighbor Methodsmentioning

confidence: 99%

Email Spam Filtering: A Systematic Review

Cormack

2008

FNT in Information Retrieval

201

141

View full text Add to dashboard Cite

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam?We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media -such as instant messaging and the Web -are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

show abstract

Section: Nearest Neighbor Methodsmentioning

confidence: 99%

Email Spam Filtering: A Systematic Review

Cormack

2008

FNT in Information Retrieval

201

141

View full text Add to dashboard Cite

show abstract

“…We do not use numeric-valued features (e.g. occurrence frequencies) because we found that they resulted in only minor improvements in overall accuracy, no significant decrease in false positives, and much increased classification and case base editing times [7].…”

Section: The Feature-based Distance Measurementioning

confidence: 99%

“…We found it better, especially from the point of view of false positives, not to use feature weighting on the binary representation [7]. We compute features from some of the header fields and the body of the emails, with no stop-word removal or stemming.…”

Section: The Feature-based Distance Measurementioning

confidence: 99%

“…In content-based filters, the classifier may make its decisions using, for example, rules, decision trees or boosted trees [2], Support Vector Machines [3], probabilities [4,5] or exemplars [5][6][7][8]. Except in the case of rules, which are most often human-authored, a learning algorithm usually induces the classifier from a set of labelled training examples.…”

mentioning

confidence: 99%

“…We have been taking a Case-Based Reasoning (CBR) approach to spam filtering [6,7,10], as have others [8], in the belief that this approach can overcome the challenges. First, individual users can maintain their own case bases to represent their personal, subjective interests.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

Delany¹,

Bridge

Case-Based Reasoning Research and Development

Self Cite

View full text Add to dashboard Cite

Abstract. In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a featurefree distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents.

show abstract

Generating Estimates of Classification Confidence for a Case-Based Spam Filter

Delany¹,

Cunningham

Doyle

et al. 2005

Case-Based Reasoning Research and Development

Self Cite

View full text Add to dashboard Cite

Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour or Naive Bayes) could readily produce confidence estimates based on thresholds. In fact, this proves not to be the case, probably because these are not probabilistic classifiers in the strict sense. The numeric scores coming from k-Nearest Neighbour or Naive Bayes classifiers are not well correlated with classification confidence. In this paper we describe a case-based spam filtering application that would benefit significantly from an ability to attach confidence predictions to positive classifications (i.e. messages classified as spam). We show that 'obvious' confidence metrics for a case-based classifier are not effective. We propose an ensemble-like solution that aggregates a collection of confidence metrics and show that this offers an effective solution in this spam filtering domain.

show abstract

An Assessment of Case-Based Reasoning for Spam Filtering

Cited by 57 publications

References 9 publications

Email Spam Filtering: A Systematic Review

Email Spam Filtering: A Systematic Review

Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

Generating Estimates of Classification Confidence for a Case-Based Spam Filter

Contact Info

Product

Resources

About