In this paper, we explore the effectiveness of dynamic analysis techniques for identifying malware, using Hidden Markov Models (HMMs) and Profile Hidden Markov Models (PHMMs), both trained on sequences of API calls. We contrast our results to static analysis using HMMs trained on sequences of opcodes, and show that dynamic analysis achieves significantly stronger results in many cases. Furthermore, in contrasting our two dynamic analysis techniques, we find that using PHMMs consistently outperforms our analysis based on HMMs.
IntroductionNews stories abound about cyber attacks relating to malware. In 2014, Twitter was at the receiving end of a major cyber attack. According to news reports, 250,000 users' email addresses, user names, and passwords were compromised. Twitter was able to detect the attack by identifying the abnormal patterns in which data was accessed [14].Target fell victim to a major security breach during the 2013 holiday season, where credit and debit card details of more than a million customers were compromised. The information was stolen by hacking the credit card swipe systems at Target stores [10]. This one attack drove down quarterly revenues of Target by tens of millions of dollars [11].In today's world of malware and cyber threats, it has become critical to develop techniques that quickly identify malware. In this paper, we look at different malware detection techniques and compare their results.We use the concept of software birthmarks for malware detection. Software birthmarks are inherent characteristics that can be used to identify particular software [23,33]. The goal is to obtain a unique identifier for each executable. We can then use these birthmarks to test the similarity between two executables. If the birthmarks of the two files are sufficiently similar, then we assume that one software is closely related to the other. This strategy has been the basis of a variety of techniques for identifying metamorphic malware with statistical approaches [17,21,23,26,33,34,41,44].