Identifying anomalies in computer networks is a challenging and complex problem.Often, anomalies occur in extremely local areas of the network. Locality is complex in this setting, since we have an underlying graph structure. To identify local anomalies, we introduce a scan statistic for data extracted from the edges of a graph over time.[24] J.I. Naus. Approximations for distributions of scan statistics.
The lack of data sets derived from operational enterprise networks continues to be a critical deficiency in the cyber security research community. Unfortunately, releasing viable data sets to the larger community is challenging for a number of reasons, primarily the difficulty of balancing security and privacy concerns against the fidelity and utility of the data. This chapter discusses the importance of cyber security research data sets and introduces a large data set derived from the operational network environment at Los Alamos National Laboratory. The hope is that this data set and associated discussion will act as a catalyst for both new research in cyber security as well as motivation for other organizations to release similar data sets to the community.
A novel approach to malware classification is introduced based on analysis of
instruction traces that are collected dynamically from the program in question.
The method has been implemented online in a sandbox environment (i.e., a
security mechanism for separating running programs) at Los Alamos National
Laboratory, and is intended for eventual host-based use, provided the issue of
sampling the instructions executed by a given process without disruption to the
user can be satisfactorily addressed. The procedure represents an instruction
trace with a Markov chain structure in which the transition matrix, $\mathbf
{P}$, has rows modeled as Dirichlet vectors. The malware class (malicious or
benign) is modeled using a flexible spline logistic regression model with
variable selection on the elements of $\mathbf {P}$, which are observed with
error. The utility of the method is illustrated on a sample of traces from
malware and nonmalware programs, and the results are compared to other leading
detection schemes (both signature and classification based). This article also
has supplementary materials available online.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS703 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.