A comparison of machine learning techniques for phishing detection

Abu-Nimeh, Saeed; Nappa, Dario; Wang, Xinlei; Nair, Suku

doi:10.1145/1299015.1299021

Cited by 334 publications

(197 citation statements)

References 10 publications

Supporting

Mentioning

178

Contrasting

Unclassified

Order By: Relevance

“…Abu-Nimeh et al [9] adopted the bag-of-words strategy and used a list of words frequently found on phishing sites as features to detect phish, which is not expressive and easy to defeat by attackers. In [16], Ludl et al came up with a total of 18 properties solely based on the HTML and URL.…”

Section: Methods For Automatic Phish Detectionmentioning

confidence: 99%

A Hierarchical Adaptive Probabilistic Approach for Zero Hour Phish Detection

Xiang

Pendleton

Hong

et al. 2010

Computer Security – ESORICS 2010

View full text Add to dashboard Cite

Abstract. Phishing attacks are a significant threat to users of the Internet, causing tremendous economic loss every year. In combating phish, industry relies heavily on manual verification to achieve a low false positive rate, which, however, tends to be slow in responding to the huge volume of unique phishing URLs created by toolkits. Our goal here is to combine the best aspects of human verified blacklists and heuristic-based methods, i.e., the low false positive rate of the former and the broad coverage of the latter. To that end, we present the design and evaluation of a hierarchical blacklist-enhanced phish detection framework. The key insight behind our detection algorithm is to leverage existing humanverified blacklists and apply the shingling technique, a popular nearduplicate detection algorithm used by search engines, to detect phish in a probabilistic fashion with very high accuracy. To achieve an extremely low false positive rate, we use a filtering module in our layered system, harnessing the power of search engines via information retrieval techniques to correct false positives. Comprehensive experiments over a diverse spectrum of data sources show that our method achieves 0% false positive rate (FP) with a true positive rate (TP) of 67.74% using searchoriented filtering, and 0.03% FP and 73.53% TP without the filtering module. With incremental model building capability via a sliding window mechanism, our approach is able to adapt quickly to new phishing variants, and is thus more responsive to the evolving attacks.

show abstract

Section: Methods For Automatic Phish Detectionmentioning

confidence: 99%

A Hierarchical Adaptive Probabilistic Approach for Zero Hour Phish Detection

Xiang

Pendleton

Hong

et al. 2010

Computer Security – ESORICS 2010

View full text Add to dashboard Cite

show abstract

“…PFILTER, which was proposed by Fette et al [8], employed SVM to distinguish phishing emails from other emails. According to [9], Abu-Nimeh et al compared the predictive accuracy of several machine learning methods including LR, CART, RF, NB, SVM, and BART. They analyzed 1,117 phishing emails and 1,718 legitimate emails with 43 features for distinguishing phishing emails.…”

Section: Related Workmentioning

confidence: 99%

“…They analyzed 973 phishing emails and 3,027 legitimate emails with 12 features, and showed that the lowest error rate was 2.01%. The experimental conditions were different between [9] and [10], however, the machine learning provided high accuracy for the detection of phishing emails.…”

Section: Related Workmentioning

confidence: 99%

An Evaluation of Machine Learning-Based Methods for Detection of Phishing Sites

Miyamoto

Hazeyama

Kadobayashi

2009

Advances in Neuro-Information Processing

View full text Add to dashboard Cite

Abstract. In this paper, we evaluate the performance of machine learningbased methods for detection of phishing sites. In our previous work [1], we attempted to employ a machine learning technique to improve the detection accuracy. Our preliminary evaluation showed the AdaBoost-based detection method can achieve higher detection accuracy than the traditional detection method. Here, we evaluate the performance of 9 machine learning techniques including AdaBoost, Bagging, Support Vector Machines, Classification and Regression Trees, Logistic Regression, Random Forests, Neural Networks, Naive Bayes, and Bayesian Additive Regression Trees. We let these machine learning techniques combine heuristics, and also let machine learning-based detection methods distinguish phishing sites from others. We analyze our dataset, which is composed of 1,500 phishing sites and 1,500 legitimate sites, classify them using the machine learning-based detection methods, and measure the performance. In our evaluation, we used f 1 measure, error rate, and Area Under the ROC Curve (AUC) as performance metrics along with our requirements for detection methods. The highest f1 measure is 0.8581, the lowest error rate is 14.15%, and the highest AUC is 0.9342, all of which are observed in the case of AdaBoost. We also observe that 7 out of 9 machine learning-based detection methods outperform the traditional detection method.

show abstract

“…In spite of this challenge, classifiers have been shown to achieve good precision in identifying phishing messages, over collections containing typical phishing messages [6,3,1], using features which are often unnoticed by (human) victims, e.g. hyperlinks to suspect websites in the email.…”

Section: Design Of An Integrated Email Filtering Systemmentioning

confidence: 99%