Feature modeling and cluster analysis of malicious Web traffic by Ana Dimitrijevikj Many attackers find Web applications to be attractive targets since they are widely used and have many vulnerabilities to exploit. The goal of this thesis is to study patterns of attacker activities on typical Web based systems using four data sets collected by honeypots, each in duration of almost four months. The contributions of our work include cluster analysis and modeling the features of the malicious Web traffic. Some of our main conclusions are: (1) Features of malicious sessions, such as Number of Requests, Bytes Transferred, and Duration, follow skewed distributions, including heavy-tailed. (2) Number of requests per unique attacker follows skewed distributions, including heavy-tailed, with a small number of attackers submitting most of the malicious traffic. (3) Cluster analysis provides an efficient way to distinguish between attack sessions and vulnerability scan sessions. First, I would like to thank my committee chair and advisor, Dr. Katerina Goseva-Popstojanova, for her guidance, support and encouragement throughout my graduate studies. Also, I would like to thank Dr. James Mooney and Dr. Arun Ross for being my graduate committee members. I am grateful for the support and advice from all my graduate committee members and I am thankful for their collaboration. I would like to acknowledge that my work has been funded by the National Science Foundation under CAREER grant CNS-0447715. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. I also want to thank and acknowledge Risto Pantev, Brandon S. Miller, J. Alex Baker, Jonathan Lynch, and David Krovich for their collaboration in the research project. I want to thank all my friends for their help and support. Finally, I would like to express my deepest gratitude to my mother, father, and brother. They all motivated and encouraged me to pursue this degree. They were always supporting me and that means the world to me. iv Contents Acknowledgements iii List of Figures vi List of Tables viii
The number of vulnerabilities and reported attacks on Web systems are showing increasing trends, which clearly illustrate the need for better understanding of malicious cyber activities. In this paper we use clustering to classify attacker activities aimed at Web systems. The empirical analysis is based on four datasets, each in duration of several months, collected by high-interaction honeypots. The results show that behavioral clustering analysis can be used to distinguish between attack sessions and vulnerability scan sessions. However, the performance heavily depends on the dataset. Furthermore, the results show that attacks differ from vulnerability scans in a small number of features (i.e., session characteristics). Specifically, for each dataset, the best feature selection method (in terms of the high probability of detection and low probability of false alarm) selects only three features and results into three to four clusters, significantly improving the performance of clustering compared to the case when all features are used. The best subset of features and the extent of the improvement, however, also depend on the dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.