Forecasting and modeling building energy profiles require tools able to discover patterns within large amounts of collected information. Clustering is the main technique used to partition data into groups based on internal and a priori unknown schemes inherent of the data. The adjustment and parameterization of the whole clustering task is complex and submitted to several uncertainties, being the similarity metric one of the first decisions to be made in order to establish how the distance between two independent vectors must be measured. The present paper checks the effect of similarity measures in the application of clustering for discovering representatives in cases where correlation is supposed to be an important factor to consider, e.g., time series. This is a necessary step for the optimized design and development of efficient clustering-based models, predictors and controllers of time-dependent processes, e.g., building energy consumption patterns. In addition, clustered-vector balance is proposed as a validation technique to compare clustering performances.
Anomaly detection in communication networks provides the basis for the uncovering of novel attacks, misconfigurations and network failures. Resource constraints for data storage, transmission and processing make it beneficial to restrict input data to features that are (a) highly relevant for the detection task and (b) easily derivable from network observations without expensive operations. Removing strong correlated, redundant and irrelevant features also improves the detection quality for many algorithms that are based on learning techniques. In this paper we address the feature selection problem for network traffic based anomaly detection. We propose a multi-stage feature selection method using filters and stepwise regression wrappers. Our analysis is based on 41 widely-adopted traffic features that are presented in several commonly used traffic data sets. With our combined feature selection method we could reduce the original feature vectors from 41 to only 16 features. We tested our results with five fundamentally different classifiers, observing no significant reduction of the detection performance. In order to quantify the practical benefits of our results, we analyzed the costs for generating individual features from standard IP Flow Information Export records, available at many routers. We show that we can eliminate 13 very costly features and thus reducing the computational effort for on-line feature generation from live traffic observations at network nodes.
The consolidation of encryption and big data in network communications have made deep packet inspection no longer feasible in large networks. Early attack detection requires feature vectors which are easy to extract, process, and analyze, allowing their generation also from encrypted traffic. So far, experts have selected features based on their intuition, previous research, or acritically assuming standards, but there is no general agreement about the features to use for attack detection in a broad scope. We compared five lightweight feature sets that have been proposed in the scientific literature for the last few years, and evaluated them with supervised machine learning. For our experiments, we use the UNSW-NB15 dataset, recently published as a new benchmark for network security. Results showed three remarkable findings: (1) Analysis based on source behavior instead of classic flow profiles is more effective for attack detection; (2) meta-studies on past research can be used to establish satisfactory benchmarks; and (3) features based on packet length are clearly determinant for capturing malicious activity. Our research showed that vectors currently used for attack detection are oversized, their accuracy and speed can be improved, and are to be adapted for dealing with encrypted traffic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.