Automated Anomaly Detection in Large Sequences

Boniol, Paul; Linardi, Michele; Roncallo, Federico; Palpanas, Themis

doi:10.1109/icde48307.2020.00182

Cited by 46 publications

(15 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, for each method and condition, our results are based on a total of N × n t = 20K measurements. For all progressive methods, we test the accuracy of their estimates after the similarity search algorithm has visited 1 (2 0 ), 4 (2 2 ), 16 (2 4 ), 64 (2 6 ), 256 (2 8 ), and 1024 (2 10 ) leaves. Figure 7 shows the distributions of visited leaves for 100 random queries for all four datasets.…”

Section: Experimental Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Data Series Progressive Similarity Search with Probabilistic Quality Guarantees

Gogolou

Tsandilas

Echihabi

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Experiments with synthetic and diverse real datasets demonstrate that our prediction methods constitute the first practical solution to the problem, significantly outperforming competing approaches.

show abstract

Section: Experimental Evaluationmentioning

confidence: 99%

“…Data series analysis involves pattern matching [54,91], anomaly detection [10,11,17,24], frequent pattern mining [56,72], clustering [48,73,74,86], and classification [19]. These tasks rely on data series similarity.…”

Section: Introductionmentioning

confidence: 99%

Data Series Progressive Similarity Search with Probabilistic Quality Guarantees

Gogolou

Tsandilas

Echihabi

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…An increasing number of applications across many diverse domains continuously produce very large amounts of data series 1 (such as in finance, environmental sciences, astrophysics, neuroscience, engineering, and others [1]- [3]), which makes them one of the most common types of data. When these sequence collections are generated (often times composed of a large number of short series [3], [4]), users need to query and analyze them (e.g., detect anomalies [5], [6]). This process is heavily dependent on data series similarity search (which apart from being a useful query in itself, also lies at the core of several machine learning methods, such as, clustering, classification, motif and outlier detection, etc.)…”

Section: Introductionmentioning

confidence: 99%

Data Series Indexing Gone Parallel

Peng

2020

2020 IEEE 36th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this Ph.D. work, we present the first data series indexing solutions, for both on-disk and in-memory data, that are designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments on a variety of synthetic and real data demonstrate that our approaches are up to orders of magnitude faster than the alternatives. More specifically, our on-disk solution can answer exact similarity search queries on 100GB datasets in a few seconds, and our inmemory solution in a few milliseconds, which enables real-time, interactive data exploration on very large data series collections.

show abstract

“…This process is heavily dependent on data series similarity search (which apart from being a useful query in itself, also lies at the core of several machine learning methods, such as, clustering, classification, motif and outlier detection, etc.) [8,9,15,44]. The brute-force approach for evaluating similarity search queries is by performing a sequential pass over the complete dataset.…”

mentioning

confidence: 99%

ParIS+: Data Series Indexing on Multi-core Architectures

Peng

Fatourou

Palpanas

2020

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

Data series similarity search is a core operation for several data series analysis applications across many different domains. Nevertheless, even state-of-the-art techniques cannot provide the time performance required for large data series collections. We propose ParIS and ParIS+, the first disk-based data series indices carefully designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments demonstrate that ParIS+ completely removes the CPU latency during index construction for disk-resident data, and for exact query answering is up to 1 order of magnitude faster than the current state of the art index scan method, and up to 3 orders of magnitude faster than the optimized serial scan method. ParIS+ (which is an evolution of the ADS+ index) owes its efficiency to the effective use of multi-core and multi-socket architectures, in order to distribute and execute in parallel both index construction and query answering, and to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities of modern CPUs, in order to further parallelize the execution of instructions inside each core. 1 introduction [Motivation] An increasing number of applications across many diverse domains continuously produce very large amounts

show abstract

Automated Anomaly Detection in Large Sequences

Cited by 46 publications

References 17 publications

Data Series Progressive Similarity Search with Probabilistic Quality Guarantees

Data Series Progressive Similarity Search with Probabilistic Quality Guarantees

Data Series Indexing Gone Parallel

ParIS+: Data Series Indexing on Multi-core Architectures

Contact Info

Product

Resources

About