Modern computer network defense systems rely primarily on signature-based intrusion detection tools, which generate alerts when patterns that are pre-determined to be malicious are encountered in network data streams. Signatures are created reactively, and only after in-depth manual analysis of a network intrusion. There is little ability for signature-based detectors to identify intrusions that are new or even variants of an existing attack, and little ability to adapt the detectors to the patterns unique to a network environment. Due to these limitations, the need exists for network intrusion detection techniques that can more comprehensively address both known and unknown network-based attacks and can be optimized for the target environment.This work describes a system that leverages machine learning to provide a network intrusion detection capability that analyzes behaviors in channels of communication between individual computers. Using examples of malicious and non-malicious traffic in the target environment, the system can be trained to discriminate between traffic types.The machine learning provides insight that would be difficult for a human to explicitly code as a signature because it evaluates many interdependent metrics simultaneously. With this approach, zero day detection is possible by focusing on similarity to known traffic types rather than mining for specific bit patterns or conditions. This also reduces the burden on organizations to account for all possible attack variant combinations through signatures. The approach is presented along with results from a third-party evaluation of its performance.
In a world where large-scale text collections are not only becoming ubiquitous but also are growing at increasing rates, near duplicate documents are becoming a growing concern that has the potential to hinder many different information filtering tasks. While others have tried to address this problem, prior techniques have only been used on limited collection sizes and static cases. We will briefly describe the problem in the context of Open Source analysis along with our additional constraints for performance. In this work we propose two variations on Multi-dimensional Spectral Hash (MDSH) tailored for working on extremely large, growing sets of text documents. We analyze the memory and runtime characteristics of our techniques and provide an informal analysis of the quality of the near-duplicate clusters produced by our techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.