An investigation of byte n-gram features for malware classification

Raff, Edward; Zak, Richard; Cox, Russell J.; Sylvester, Jared; Yacci, Paul; Ward, Robin V.; Tracy, Anna; McLean, Mark; Nicholas, Charles

doi:10.1007/s11416-016-0283-1

Cited by 120 publications

(108 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In such a case it is generally beneficial to perform feature selection on the n-grams extracted due to computational constraints and to reduce the impact from the curse of dimensionality [1,2]. This can be done by using information-gain or simply removing features that do not reach a minimum frequency [25,41,45]. Extracting the 328 bytes from our headers of interest significantly reduces the amount of data to process, increasing the flexibility of what we can experiment with when using n-grams.…”

Section: Appendix a N-gram Detailsmentioning

confidence: 99%

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Raff

Sylvester

Nicholas

2017

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

Self Cite

123

View full text Add to dashboard Cite

Many efforts have been made to use various forms of domain knowledge in malware detection. Currently there exist two common approaches to malware detection without domain knowledge, namely byte ngrams and strings. In this work we explore the feasibility of applying neural networks to malware detection and feature learning. We do this by restricting ourselves to a minimal amount of domain knowledge in order to extract a portion of the Portable Executable (PE) header. By doing this we show that neural networks can learn from raw bytes without explicit feature construction, and perform even better than a domain knowledge approach that parses the PE header into explicit features.

show abstract

Section: Appendix a N-gram Detailsmentioning

confidence: 99%

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Raff

Sylvester

Nicholas

2017

Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

Self Cite

123

View full text Add to dashboard Cite

show abstract

“…Intuitively, we expect that as the training data becomes more generic, the models will become less accurate, and our results do indeed support this intuition. We believe that the results that we provide in this paper cast the work presented in (Raff et al, 2016) in a much different light, namely, that the inability to construct a strong model based on the extremely diverse and generic data follows immediately from the generality of the data itself, rather than being an inherent weakness of a particular feature, such as n-grams.…”

Section: Introductionmentioning

confidence: 84%

“…Previous research has shown that a variety of techniques based on byte n-grams can achieve relatively high accuracies for the detection problem (Liangboonprakong and Sornil, 2013; Reddy and Pujari, 2006;Shabtai et al, 2009;Tabish et al, 2009). However, a recent study based on n-gram analysis rejects this view and argues that n-grams promote a gross level of overfitting (Raff et al, 2016).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Effectiveness of Generic Malware Models

Bagga

Troia

Stamp

2018

Proceedings of the 15th International Joint Conference on E-Business and Telecommunications

View full text Add to dashboard Cite

Malware detection based on machine learning typically involves training and testing models for each malware family under consideration. While such an approach can generally achieve good accuracy, it requires many classification steps, resulting in a slow, inefficient, and potentially impractical process. In contrast, classifying samples as malware or benign based on more generic "families" would be far more efficient. However, extracting common features from extremely general malware families will likely result in a model that is too generic to be useful. In this research, we perform controlled experiments to determine the tradeoff between generality and accuracy-over a variety of machine learning techniques-based on n-gram features. 2.1 Related Work Wong and Stamp (Wong and Stamp, 2006) show that hidden Markov model (HMM) analysis applied to op-442

show abstract

“…-We show least squares regression in the form of the ELM with a non-linear kernel can provide a template to fully enhance the feature space rather than implicit feature selection of the regressor used in [17]. The Malytics generalization performance for unseen data also shows the effectiveness of the applied regularization technique.…”

Section: Introductionmentioning

confidence: 95%

Malytics: A Malware Detection Scheme

et al. 2018

View full text Add to dashboard Cite

An important problem of cyber-security is malware analysis. Besides good precision and recognition rate, a malware detection scheme needs to be able to generalize well for novel malware families (a.k.a zero-day attacks). It is important that the system does not require excessive computation particularly for deployment on the mobile devices.In this paper, we propose a novel scheme to detect malware which we call Malytics. It is not dependent on any particular tool or operating system. It extracts static features of any given binary file to distinguish malware from benign. Malytics consists of three stages: feature extraction, similarity measurement and classification. The three phases are implemented by a neural network with two hidden layers and an output layer. We show feature extraction, which is performed by tf -simhashing, is equivalent to the first layer of a particular neural network. We evaluate Malytics performance on both Android and Windows platforms. Malytics outperforms a wide range of learning-based techniques and also individual state-of-the-art models on both platforms. We also show Malytics is resilient and robust in addressing zero-day malware samples. The F1-score of Malytics is 97.21% and 99.45% on Android dex file and Windows PE files respectively, in the applied datasets. The speed and efficiency of Malytics are also evaluated.

show abstract

An investigation of byte n-gram features for malware classification

Cited by 120 publications

References 29 publications

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

On the Effectiveness of Generic Malware Models

Malytics: A Malware Detection Scheme

Contact Info

Product

Resources

About