MESSI: In-Memory Data Series Indexing

Peng, Botao; Fatourou, Panagiota; Palpanas, Themis

doi:10.1109/icde48307.2020.00036

Cited by 26 publications

(10 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data series 1 have gathered the attention of the data management community for more than two decades (Agrawal et al, 1993;Jagadish et al, 1995;Rafiei and Mendelzon, 1998;Chakrabarti et al, 2002;Papadimitriou and Yu, 2006;Camerra et al, 2010;Kashyap and Karras, 2011;Wang et al, 2013b;Camerra et al, 2014;Dallachiesa et al, 2014;Zoumpatianos et al, 2016;Yagoubi et al, 2017;Jensen et al, 2017;Palpanas, 2017;Kondylakis et al, 2018;Peng et al, 2018;Gogolou et al, 2019;Echihabi et al, 2018Echihabi et al, , 2019Yagoubi et al, 2020;Kondylakis et al, 2019;Peng et al, 2020a;Peng et al, 2020b;Palpanas, 2020;Gogolou et al, 2020). They are now one of the most common types of data, present in virtually every scientific and social domain (Palpanas, 2015;Raza et al, 2015;Mirylenka et al, 2016;Keogh, 2011;Palpanas and Beckmann, 2019;Bagnall et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Linardi

Zhu

Palpanas

et al. 2020

Data Min Knowl Disc

Self Cite

View full text Add to dashboard Cite

In the last fifteen years, data series motif and discord discovery have emerged as two useful and well-used primitives for data series mining, with applications to many domains, including robotics, entomology, seismology, medicine, and climatology. Nevertheless, the state-of-the-art motif and discord discovery tools still require the user to provide the relative length. Yet, in several cases, the choice of length is critical and unforgiving. Unfortunately, the obvious brute-force solution, which tests all lengths within a given range, is computationally untenable. In this work, we introduce a new framework, which provides an exact and scalable motif and discord discovery algorithm that efficiently finds all motifs and discords in a given range of lengths. We evaluate our approach with five diverse real datasets, and demonstrate that it is up to 20 times faster than the state-of-the-art. Our results also show that removing the unrealistic assumption that the user knows the correct length, can often produce more intuitive and actionable results, which could have otherwise been missed.

show abstract

Section: Introductionmentioning

confidence: 99%

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Linardi

Zhu

Palpanas

et al. 2020

Data Min Knowl Disc

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similarity Search. A large number of data series similarity search methods has been studied, supporting exact search [7,137,124,81,127,110], approximate search [136,85,10,46,49], or both [32,134,152,33,163,157,88,96,95,122,158,90,123]. In parallel, the research community has also developed exact [23,67,22,26,44,154,57] and approximate [73] similarity search techniques geared towards generic multidimensional vector data 2 .…”

Section: Introductionmentioning

confidence: 99%

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

Echihabi¹,

Zoumpatianos²,

Palpanas³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Data series are a special type of multidimensional data present in numerous domains, where similarity search is a key operation that has been extensively studied in the data series literature. In parallel, the multidimensional community has studied approximate similarity search techniques. We propose a taxonomy of similarity search techniques that reconciles the terminology used in these two domains, we describe modifications to data series indexing techniques enabling them to answer approximate similarity queries with quality guarantees, and we conduct a thorough experimental evaluation to compare approximate similarity search techniques under a unified framework, on synthetic and real datasets in memory and on disk. Although data series differ from generic multidimensional vectors (series usually exhibit correlation between neighboring values), our results show that data series techniques answer approximate queries with strong guarantees and an excellent empirical performance, on data series and vectors alike. These techniques outperform the state-of-the-art approximate techniques for vectors when operating on disk, and remain competitive in memory.

show abstract

“…Thus, tradi-tional solutions and systems are inefficient at, or incapable of managing and processing the voluminous sequence collections that already exist in several domains. Finally, we note that, given the evolution of CPU performance, where the processor clock speed is not increasing due to the power wall constraint, efforts for algorithmic speedups now exploit the parallelism opportunities offered by modern hardware [5,10,35,39,47].…”

mentioning

confidence: 99%

ParIS+: Data Series Indexing on Multi-core Architectures

Peng

Fatourou

Palpanas

2020

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

Data series similarity search is a core operation for several data series analysis applications across many different domains. Nevertheless, even state-of-the-art techniques cannot provide the time performance required for large data series collections. We propose ParIS and ParIS+, the first disk-based data series indices carefully designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments demonstrate that ParIS+ completely removes the CPU latency during index construction for disk-resident data, and for exact query answering is up to 1 order of magnitude faster than the current state of the art index scan method, and up to 3 orders of magnitude faster than the optimized serial scan method. ParIS+ (which is an evolution of the ADS+ index) owes its efficiency to the effective use of multi-core and multi-socket architectures, in order to distribute and execute in parallel both index construction and query answering, and to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities of modern CPUs, in order to further parallelize the execution of instructions inside each core. 1 introduction [Motivation] An increasing number of applications across many diverse domains continuously produce very large amounts

show abstract

MESSI: In-Memory Data Series Indexing

Cited by 26 publications

References 32 publications

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Matrix profile goes MAD: variable-length motif and discord discovery in data series

Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search

ParIS+: Data Series Indexing on Multi-core Architectures

Contact Info

Product

Resources

About