Abstract:The main aim of this work is to develop and implement an automatic anomaly detection algorithm for meteorological time-series. To achieve this goal we develop an approach to constructing an ensemble of anomaly detectors in combination with adaptive threshold selection based on artificially generated anomalies. We demonstrate the efficiency of the proposed method by integrating the corresponding implementation into "Minimax-94" road weather information system.
“…earthquakes with big magnitude are rare events, a kind of anomalies. Thus we can first detect sequences of anomalies of different types in the historical stream of earthquake data [3,26,9,19,32], and then we can construct ensembles for rare events prediction [2,29] using detected anomalies and their features as precursors of major earthquakes to optimize specific detection metrics similar to the one used in [7], use privileged information about the future events, which is accessible during the training stage. Analogous approach, used in [8,28] for anomaly detection, allowed significant accuracy improvement, historical data on earthquakes has a spatial component, thus a graph of dependency between streams of events, registered by different ground stations can be constructed and modern methods for graph feature learning [20] and panel time-series feature extraction [24,23] ROC AUC score measures the quality of binary classifier.…”
We construct a classification model that predicts if an earthquake with the magnitude above a threshold will take place at a given location in a time range 30-180 days from a given moment of time. A common approach is to use expert forecasts based on features like Region-Time-Length (RTL) characteristics. The proposed approach uses machine learning on top of multiple RTL features to take into account effects at various scales and to improve prediction accuracy. For historical data about Japan earthquakes 1992-2005 and predictions at locations given in this database the best model has precision up to ∼ 0.95 and recall up to ∼ 0.98.
“…earthquakes with big magnitude are rare events, a kind of anomalies. Thus we can first detect sequences of anomalies of different types in the historical stream of earthquake data [3,26,9,19,32], and then we can construct ensembles for rare events prediction [2,29] using detected anomalies and their features as precursors of major earthquakes to optimize specific detection metrics similar to the one used in [7], use privileged information about the future events, which is accessible during the training stage. Analogous approach, used in [8,28] for anomaly detection, allowed significant accuracy improvement, historical data on earthquakes has a spatial component, thus a graph of dependency between streams of events, registered by different ground stations can be constructed and modern methods for graph feature learning [20] and panel time-series feature extraction [24,23] ROC AUC score measures the quality of binary classifier.…”
We construct a classification model that predicts if an earthquake with the magnitude above a threshold will take place at a given location in a time range 30-180 days from a given moment of time. A common approach is to use expert forecasts based on features like Region-Time-Length (RTL) characteristics. The proposed approach uses machine learning on top of multiple RTL features to take into account effects at various scales and to improve prediction accuracy. For historical data about Japan earthquakes 1992-2005 and predictions at locations given in this database the best model has precision up to ∼ 0.95 and recall up to ∼ 0.98.
“…Almost any feature based machine learning method may be applied to anomaly detection problems, and approaches described in the literature include principal components analysis, support vector machines (Tran et al, 2019), HDOutliers (Leigh et al, 2018), k-nearest neighbor (Russo et al, 2020;Talagala et al, 2019), clustering (Hill and Minsker, 2010), random forest (Russo et al, 2020), xgboost, and isolated forest (Smolyakov et al, 2019). The success of feature based techniques in detecting anomalies from environmental sensor data is mixed (Hill and Minsker, 2010;Leigh et al, 2018;Russo et al, 2020).…”
Sensors measuring environmental phenomena at high frequency commonly report anomalies related to fouling, sensor drift and calibration, and datalogging and transmission issues. Suitability of data for analyses and decision making often depends on manual review and adjustment of data. Machine learning techniques have potential to automate identification and correction of anomalies, streamlining the quality control process. We explored approaches for automating anomaly detection and correction of aquatic sensor data for implementation in a Python package (PyHydroQC). We applied both classical and deep learning time series regression models that estimate values, identify anomalies based on dynamic thresholds, and offer correction estimates. Techniques were developed and performance assessed using data reviewed, corrected, and labeled by technicians in an aquatic monitoring use case. Auto-Regressive Integrated Moving Average (ARIMA) consistently performed best, and aggregating results from multiple models improved detection. PyHydroQC includes custom functions and a workflow for anomaly detection and correction.
“…In recent years, the popularity and achieved results of the ensemble approach in the outlier detection problem have grown as well. The current state of the ensemble analysis and various ensemble procedures for the outlier detection problem are represented in the following papers [12][13][14][15][16][17]. Although outlier detection and changepoint detection problems are often considered subproblems of general anomaly detection problem, the ensemble approach in the changepoint detection problem is weakly formalized and much less highlighted.…”
Section: Introductionmentioning
confidence: 99%
“…Model-centered: these are the models that we use to create an ensemble, but we do not pick subsets of data points or data features (data-centered). A variety of scaling and aggregation functions for outlier, changepoint, classification ensembles, as well as the related issues can be found in papers [12][13][14]16,[18][19][20]27,28]. Though scaling can be included in and considered part of aggregation procedure [4], we treat it separately from the aggregation function.…”
mentioning
confidence: 99%
“…Difference between static and dynamic weighting is presented in [29]. Commonly, the weights for various models or cost functions are predetermined [16,29]. For unsupervised offline ensembles, the weights can show the degree of confidence of each separate detector.…”
Offline changepoint detection (CPD) algorithms are used for signal segmentation in an optimal way. Generally, these algorithms are based on the assumption that signal’s changed statistical properties are known, and the appropriate models (metrics, cost functions) for changepoint detection are used. Otherwise, the process of proper model selection can become laborious and time-consuming with uncertain results. Although an ensemble approach is well known for increasing the robustness of the individual algorithms and dealing with mentioned challenges, it is weakly formalized and much less highlighted for CPD problems than for outlier detection or classification problems. This paper proposes an unsupervised CPD ensemble (CPDE) procedure with the pseudocode of the particular proposed ensemble algorithms and the link to their Python realization. The approach’s novelty is in aggregating several cost functions before the changepoint search procedure running during the offline analysis. The numerical experiment showed that the proposed CPDE outperforms non-ensemble CPD procedures. Additionally, we focused on analyzing common CPD algorithms, scaling, and aggregation functions, comparing them during the numerical experiment. The results were obtained on the two anomaly benchmarks that contain industrial faults and failures—Tennessee Eastman Process (TEP) and Skoltech Anomaly Benchmark (SKAB). One of the possible applications of our research is the estimation of the failure time for fault identification and isolation problems of the technical diagnostics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.