In this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the learning-based anti-spam filter is based on a tuneable enhanced instance retrieval network able to accurately generalize e-mail representations. The reuse of similar messages is carried out by a simple unanimous voting mechanism to determine whether the target case is spam or not. Previous to the final response of the system, the revision stage is only performed when the assigned class is spam whereby the system employs general knowledge in the form of metarules.
A great amount of machine learning techniques have been applied to problems where data is collected over an extended period of time. However, the disadvantage with many real-world applications is that the distribution underlying the data is likely to change over time. In these situations, a problem that many global eager learners face is their inability to adapt to local concept drift. Concept drift in spam is particularly difficult as the spammers actively change the nature of their messages to elude spam filters. Algorithms that track concept drift must be able to identify a change in the target concept (spam or legitimate e-mails) without direct knowledge of the underlying shift in distribution. In this paper we show how a previously successful instance-based reasoning e-mail filtering model can be improved in order to better track concept drift in spam domain. Our proposal is based on the definition of two complementary techniques able to select both terms and e-mails representative of the current situation. The enhanced system is evaluated against other well-known successful lazy learning approaches in two scenarios, all within a cost-sensitive framework. The results obtained from the experiments carried out are very promising and back up the idea that instance-based reasoning systems can offer a number of advantages tackling concept drift in dynamic problems, as in the case of the anti-spam filtering domain.
This paper introduces Wirebrush4SPAM, a plug-in-based C framework specifically designed for the development of fast spam filters by assembling different antispam schemes and techniques. Wirebrush4SPAM can be used to (i) build, execute and deploy simple spam filters and (ii) develop new techniques that can be easily combined and tested to achieve more accurate antispam models. To construct custom filters, programmers should manage three key concepts: filtering functions, parsers and event listeners. The main features of Wirebrush4SPAM include (i) a plug-in-based design, (ii) cache support for developing new plugins, (iii) a smart filter evaluation heuristic for improving filter execution, (iv) configurable rule scheduling and (v) support for domain specific rules. Moreover, Wirebrush4SPAM is 10 times faster than SpamAssassin, which stands for the most popular and highly extensible framework for spam filtering. Wirebrush4SPAM is an open-source project licensed under the terms of GNU lesser general public license and both source code and documentation are publicly available at http://www.wb4spam.org/. . filter execution, (iv) configurable rule scheduling and (v) support for domain specific rules.The rest of the paper is structured as follows: Section 2 introduces the current status of the SpamAssassin framework. Section 3 describes the Wirebrush4SPAM project evidencing the main differences and improvements with respect to SpamAssassin. Section 4 shows how to build Wire-brush4SPAM filters while Section 5 presents and discusses the results of an empirical efficiency comparison between Wirebrush4SPAM and SpamAssassin. Finally, Section 6 summarizes the main conclusions extracted from this work and outlines future research lines.
THE SPAMASSASSIN FRAMEWORKNowadays, SpamAssassin stands for the most popular and highly extensible framework for spam filtering. The whole project has been developed following an object-oriented design and includes a During our perceptive analysis of SpamAssassin, we detected some weaknesses causing a downturn in filter efficiency. The usage of an interpreted programming language, the lack of appropriate cache structures, the absence of rule execution schemes to improve efficiency and different deficiencies in multithreading scheme are included in this group. Moreover, some companies reported the need
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.