The research literature on cybersecurity incident detection & response is very rich in automatic detection methodologies, in particular those based on the anomaly detection paradigm. However, very little attention has been devoted to the diagnosis ability of the methods, aimed to provide useful information on the causes of a given detected anomaly. This information is of utmost importance for the security team to reduce the time from detection to response. In this paper, we present Multivariate Big Data Analysis (MBDA), a complete intrusion detection approach based on 5 steps to effectively handle massive amounts of disparate data sources. The approach has been designed to deal with the main characteristics of Big Data, that is, the high volume, velocity and variety. The core of the approach is the Multivariate Statistical Network Monitoring (MSNM) technique proposed in a recent paper. Unlike in state of the art machine learning methodologies applied to the intrusion detection problem, when an anomaly is identified in MBDA the output of the system includes the detail of the logs of raw information associated to this anomaly, so that the security team can use this information to elucidate its root causes. MBDA is based in two open software packages available in Github: the MEDA Toolbox and the FCParser. We illustrate our approach with two case studies. The first one demonstrates the application of MBDA to semistructured sources of information, using the data from the VAST 2012 mini challenge 2. This complete case study is supplied in a virtual machine available for download. In the second case study we show the Big Data capabilities of the approach in data collected from a real network with labeled attacks.
Multivariate Statistical Process Control (MSPC) based on Principal Component Analysis (PCA) is a well-known methodology in chemometrics that is aimed at testing whether an industrial process is under Normal Operation Conditions (NOC). As a part of the methodology, once an anomalous behaviour is detected, the root causes need to be diagnosed to troubleshoot the problem and/or avoid it in the future. While there have been a number of developments in diagnosis in the past decades, no sound method for comparing existing approaches has been proposed. In this paper, we propose such a procedure and use it to compare several diagnosis methods using randomly simulated data and from realistic data sources. This is a general comparative approach that takes into account factors that have not previously been considered in the literature. The results show that univariate diagnosis is more reliable than its multivariate counterpart.
Network Security Monitoring (NSM) is a popular term to refer to the detection of security incidents by monitoring the network events. An NSM system is central for the security of current networks, given the escalation in sophistication of cyberwarfare. In this paper, we review the state-of-the-art in NSM, and derive a new taxonomy of the functionalities and modules in an NSM system. This taxonomy is useful to assess current NSM deployments and tools for both researchers and practitioners. We organize a list of popular tools according to this new taxonomy, and identify challenges in the application of NSM in modern network deployments, like Software Defined Network (SDN) and Internet of Things (IoT).
Resumen-La evaluación de algoritmos y técnicas para implementar sistemas de detección de intrusiones depende en gran medida de la existencia de conjuntos de datos (dataset) bien diseñados. En losúltimos años, se ha realizado un gran esfuerzo para construir estos datasets. En este trabajo se presenta un nuevo dataset que se construye a partir de tráfico real y donde se realizan ataques actualizados. La principal ventaja de este conjunto de datos sobre otros previos es su utilidad para la evaluación de IDSs donde se considera la evolución a largo plazo y la periodicidad del tráfico. También permite entrenar y evaluar modelos que contemplen las diferencias entre día/noche o entre días laborables/fines de semana.Palabras Clave-seguridad en redes, dataset, IDS, tráfico de red, netflow I. INTRODUCCIÓN Los Sistemas de Detección de Intrusiones (IDS) aparecieron en la esfera de la seguridad como una solución al problema de identificar actividades maliciosas en redes y sistemas. En pocas palabras, un IDS consta de un módulo encargado de la obtención de datos, un módulo de preprocesamiento que adapta esos datos para los siguientes pasos en el sistema, y un módulo de decisión capaz de determinar si un evento debe ser considerado malicioso o no.Existen varios tipos de IDS [1]: los IDSs basados en red (NIDS) monitorizan eventos de red como flujos o logs de cortafuegos, entre otros, mientras que los IDS basados en host (HIDS) consideran eventos relacionados con el sistema, por ejemplo syslog, monitorización de sistemas de archivos, carga de la CPU, etc. Los IDS también se clasifican de acuerdo al proceso de detección. Así, los IDS basados en firmas (S-IDS) hacen uso de reglas para decidir si un comportamiento observado es malicioso o no, mientras que los IDS basados en anomalías (A-IDS) [2] construyen un modelo a partir de datos de entrenamiento y consideran que cualquier comportamiento que se desvía de este modelo es anómalo. Hay que destacar que, aunque existe una diferencia semántica entre un comportamiento anómalo y uno malicioso, un A-IDS los considera equivalentes.Un problema esencial cuando se evalúan las capacidades de los IDS es la necesidad de un conjunto de datos representativo que permita la comparación entre distintas propuestas. En los años 90 DARPA llevó a cabo un proyecto para construir un conjunto de datos con este fin, generándose los datasets DARPA'98 y DARPA'99 del MIT. [3]. Después de ser utilizados y estudiados ampliamente por varios autores, se identificaron algunas limitaciones, como la existencia de registros duplicados, muestras no balanceadas entre ataques y conexiones normales, y otras limitaciones inherentes por considerar tráfico sintético. Desde entonces, muchos otros investigadores y proyectos han intentado proporcionar versiones mejoradas de estos conjuntos de datos, como NSL-KDD, o construir nuevos datasets.Más recientemente se han propuesto otros conjuntos de datos. Por ejemplo, UNB ISCX 2012, creado en 2012 por Shiravi et al. [4]. La contribución más relevante de este trabajo es el uso de pe...
Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications in the Future Internet. The key to handle complexity is to perform tasks like network optimization and failure recovery with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. However, ML can only be as good as the data it is fitted with. Datasets provided to the community as benchmarks for research purposes, which have a relevant impact in research findings and directions, are often assumed to be of good quality by default. In this paper, we show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. To understand this finding, we contribute a methodology to investigate the root causes for those differences, and to assess the quality of the data labelling. Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
Since the pioneering works by Nomikos and MacGregor, the Batch Multivariate Statistical Process Control (BMSPC) methodology has been extensively revised, and a sheer number of alternative monitoring approaches have been suggested. The different approaches vary in the batch data alignment, the pre-processing approach, the data arrangement, and/or the type of model used, from two-way to three-way and from linear to nonlinear. One of the most accepted pre-processing schemes, referred to as the trajectory centering and scaling (TCS), is based on the normalization to zero mean and unit variance around the average trajectory. However, the main drawback of TCS is the inherent increase of the level of uncertainty in the estimation of model parameters. In this work, we illustrate how to improve parameter estimation while maintaining the good properties of this pre-processing approach. This enhancement is achieved with the new pre-processing approach we call PARAMO, which uses more observations than TCS to estimate the pre-processing parameters. We show that this improvement favorably impacts the performance of the monitoring system. The results of this research work affect a large amount of the monitoring approaches proposed to date, and we advocate that the pre-processing procedure proposed here should be generally applied in BMSPC.
Colección JORNADAS Y CONGRESOS n.º 34 Esta editorial es miembro de la UNE, lo que garantiza la difusión y comercialización de sus publicaciones a nivel nacional e internacional.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.