Scalable data series subsequence matching with ULISSE

Linardi, Michele; Palpanas, Themis

doi:10.1007/s00778-020-00619-4

Cited by 20 publications

(7 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is true for other iSAX-based indices. For example, we could parallelize in a way similar to MESSI the ULISSE index [53], which supports queries of variable length, as well as the DPiSAX index [85], which is a distributed index operating on top of Spark (but currently not supporting parallel execution within each node of the Spark cluster). It is an interesting open problem to study whether there exist efficient parallelization techniques for indexing schemes whose tree index does not satisfy this large fanout property that would result in better perfomance than MESSI.…”

Section: Discussionmentioning

confidence: 99%

Fast Data Series Indexing for In-Memory Data

Peng¹,

Fatourou²,

Palpanas³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-socket and multi-core architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. MESSI supports similarity search using both the Euclidean and Dynamic Time Warping (DTW) distances. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in ∼50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

show abstract

Section: Discussionmentioning

confidence: 99%

Fast Data Series Indexing for In-Memory Data

Peng¹,

Fatourou²,

Palpanas³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Even though much effort has been dedicated for developping techniques for data series analytics, existing solutions for subsequence matching, motif and discord discovery are limited to fixed length queries/results. In this Ph.D. work, we propose the first scalable solutions to the variable-length version of these problems: ULISSE is the first index that supports variable-length subsequence matching over both Z-normalized and non Z-normalized sequences [15,13,14], while MAD is the first framework that implements variablelength motif and discord discovery [17,4,16].…”

Section: Discussionmentioning

confidence: 99%

“…1. ULISSE (ULtra compact Index for variable-length Similarity SEarch in data series) is the first indexing technique that supports variable-length subsequence matching for non Z-normalized and Z-normalized data series [15,13,14].…”

Section: Introductionmentioning

confidence: 99%

Effective and Efficient Variable-Length Data Series Analytics

Linardi¹

2020

Preprint

Self Cite

View full text Add to dashboard Cite

In the last twenty years, data series similarity search has emerged as a fundamental operation at the core of several analysis tasks and applications related to data series collections. Many solutions to different mining problems work by means of similarity search. In this regard, all the proposed solutions require the prior knowledge of the series length on which similarity search is performed. In several cases, the choice of the length is critical and sensibly influences the quality of the expected outcome. Unfortunately, the obvious brute-force solution, which provides an outcome for all lengths within a given range is computationally untenable. In this Ph.D. work, we present the first solutions that inherently support scalable and variable-length similarity search in data series, applied to sequence/subsequences matching, motif and discord discovery problems. The experimental results show that our approaches are up to orders of magnitude faster than the alternatives. They also demonstrate that we can remove the unrealistic constraint of performing analytics using a predefined length, leading to more intuitive and actionable results, which would have otherwise been missed.

show abstract

“…Optimized and Approximate Similarity Search. The database community has optimized similarity search methods by using index structures [22,27,28,33,49,72,73,82,83,132,134,139,146] or fast sequential scans [112]. Recently, Echihabi et al [47,48] compared the efficiency of these methods under a unified experimental framework, showing that there is no single best method that outperforms all the rest.…”

Section: Contributionsmentioning

confidence: 99%

“…In this context, progressive answers help to speed-up exact queries by stopping execution early, when it is highly probable that the current progressive answer is the exact one. Note that several data series similarity search methods support approximate query answering that can produce increasingly more accurate answers as time goes by [23,51,73,83,134,146], though, none of them provides quality guarantees on the answers. In this work, we focus on the iSAX2+ [23] and DSTree [134] methods, which exhibit superior performance at the similarity search task [47,48].…”

Section: Contributionsmentioning

confidence: 99%

ProS: data series progressive k-NN similarity search and classification with probabilistic quality guarantees

et al. 2022

Self Cite

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Scalable data series subsequence matching with ULISSE

Cited by 20 publications

References 60 publications

Fast Data Series Indexing for In-Memory Data

Fast Data Series Indexing for In-Memory Data

Effective and Efficient Variable-Length Data Series Analytics

ProS: data series progressive k-NN similarity search and classification with probabilistic quality guarantees

Contact Info

Product

Resources

About