Hyrum S. Anderson scite author profile

This paper describes EMBER: a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). To accompany the dataset, we also release open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. We enumerate several use cases that we considered when structuring the dataset. Additionally, we demonstrate one use case wherein we compare a baseline gradient boosted decision tree model trained using LightGBM with default settings to MalConv, a recently published end-to-end (featureless) deep learning model for malware detection. Results show that even without hyperparameter optimization, the baseline EMBER model outperforms MalConv. The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

show abstract

Sparse imaging for fast electron microscopy

Anderson¹,

Ilic-Helms²,

Rohrer³

et al. 2013

View full text Add to dashboard Cite

Scanning electron microscopes (SEMs) are used in neuroscience and materials science to image centimeters of sample area at nanometer scales. Since imaging rates are in large part SNR-limited, large collections can lead to weeks of around-the-clock imaging time. To increase data collection speed, we propose and demonstrate on an operational SEM a fast method to sparsely sample and reconstruct smooth images. To accurately localize the electron probe position at fast scan rates, we model the dynamics of the scan coils, and use the model to rapidly and accurately visit a randomly selected subset of pixel locations. Images are reconstructed from the undersampled data by compressed sensing inversion using image smoothness as a prior. We report image fidelity as a function of acquisition speed by comparing traditional raster to sparse imaging modes. Our approach is equally applicable to other domains of nanometer microscopy in which the time to position a probe is a limiting factor (e.g., atomic force microscopy), or in which excessive electron doses might otherwise alter the sample being observed (e.g., scanning transmission electron microscopy).

show abstract

DeepDGA

Anderson¹,

Woodbridge²,

Filar³

2016

137

View full text Add to dashboard Cite

Detecting Homoglyph Attacks with a Siamese Neural Network

Woodbridge¹,

Anderson²,

Ahuja³

et al. 2018

View full text Add to dashboard Cite

A homoglyph (name spoofing) attack is a common technique used by adversaries to obfuscate file and domain names. This technique creates process or domain names that are visually similar to legitimate and recognized names. For instance, an attacker may create malware with the name svch0st.exe so that in a visual inspection of running processes or a directory listing, the process or file name might be mistaken as the Windows system process svchost.exe. There has been limited published research on detecting homoglyph attacks. Current approaches rely on string comparison algorithms (such as Levenshtein distance) that result in computationally heavy solutions with a high number of false positives. In addition, there is a deficiency in the number of publicly available datasets for reproducible research, with most datasets focused on phishing attacks, in which homoglyphs are not always used. This paper presents a fundamentally different solution to this problem using a Siamese convolutional neural network (CNN). Rather than leveraging similarity based on character swaps and deletions, this technique uses a learned metric on strings rendered as images: a CNN learns features that are optimized to detect visual similarity of the rendered strings. The trained model is used to convert thousands of potentially targeted process or domain names to feature vectors. These feature vectors are indexed using randomized KD-Trees to make similarity searches extremely fast with minimal computational processing. This technique shows a considerable 13% to 45% improvement over baseline techniques in terms of area under the receiver operating characteristic curve (ROC AUC). In addition, we provide both code and data to further future research.

show abstract

Fabrication and Joining of Ceramic Compact Heat Exchangers for Process Integration

Lewinsohn¹,

Wilson²,

Fellows³

et al. 2012

Int J Applied Ceramic Tech

View full text Add to dashboard Cite

Many energy conversion systems use thermal processes to convert chemical energy to mechanical or electrical energy. There are also many industrial processes performed at high temperature that produce excess heat. In both of these situations, heat exchangers can be used to recover thermal energy and make the processes more energy efficient. The more heat that can be recovered, the more efficient the process will be, so there is a strong demand for heat exchangers that operate at as high a temperature as possible. Ceramic heat exchangers permit operation at higher temperatures than with other materials. Additionally, compact heat exchangers are highly efficient and cost-effective. This paper will describe principles of design, methods of fabrication, and joining methods for ceramic compact heat exchangers for integration of such heat exchangers into practical applications.

show abstract

Reliable early classification of time series

Anderson

Parrish

Tsukida

et al. 2012

View full text Add to dashboard Cite

Joint deconvolution and classification with applications to passive acoustic underwater multipath

Anderson

Gupta

2008

View full text Add to dashboard Cite

This paper addresses the problem of classifying signals that have been corrupted by noise and unknown linear time-invariant (LTI) filtering such as multipath, given labeled uncorrupted training signals. A maximum a posteriori approach to the deconvolution and classification is considered, which produces estimates of the desired signal, the unknown channel, and the class label. For cases in which only a class label is needed, the classification accuracy can be improved by not committing to an estimate of the channel or signal. A variant of the quadratic discriminant analysis (QDA) classifier is proposed that probabilistically accounts for the unknown LTI filtering, and which avoids deconvolution. The proposed QDA classifier can work either directly on the signal or on features whose transformation by LTI filtering can be analyzed; as an example a classifier for subband-power features is derived. Results on simulated data and real Bowhead whale vocalizations show that jointly considering deconvolution with classification can dramatically improve classification performance over traditional methods over a range of signal-to-noise ratios.

show abstract

Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Raff

Fleshman

Zak

et al. 2021

AAAI

View full text Add to dashboard Cite

Recent works within machine learning have been tackling inputs of ever increasing size, with cyber security presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, an input executable could be >=100 MB, which would translate to a time series with T=100,000,000 steps. To date, the closest approach to handling such task is MalConv --- a convolutional neural network capable of processing T=2,000,000 steps. Because the memory used by CNNs is O(T), this has prevented many from processing all executables or further extending the MalConv approach. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length T. This makes MalConv 116x more memory efficient, and up to 25.8x faster to train, while removing the input length restrictions to MalConv. We re-invest these gains into improving the MalConv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv approach.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hyrum S. Anderson

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

Sparse imaging for fast electron microscopy

DeepDGA

Detecting Homoglyph Attacks with a Siamese Neural Network

Fabrication and Joining of Ceramic Compact Heat Exchangers for Process Integration

Reliable early classification of time series

Joint deconvolution and classification with applications to passive acoustic underwater multipath

Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Contact Info

Product

Resources

About