Coconut: sortable summarizations for scalable indexes over static and streaming data series

Kondylakis, Haridimos; Dayan, Niv; Zoumpatianos, Kostas; Palpanas, Themis

doi:10.1007/s00778-019-00573-w

Cited by 21 publications

(10 citation statements)

References 67 publications

(134 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various dimensionality reduction techniques exist for data series, which can then be scanned and filtered [38,49] or in-dexed and pruned [20][21][22][42][43][44]52,61,65,75,76,81,89] during query answering, including deep-learned methods [80]; for a complete discussion of such techniques, we refer the reader to two recent tutorials on the subject [25,26]. We follow the same approach of indexing the series based on their summaries, though our work is the first to exploit the parallelization opportunities offered by modern hardware, in order to accelerate in-memory index construction and similarity search for data series.…”

Section: Related Workmentioning

confidence: 99%

Fast Data Series Indexing for In-Memory Data

Peng¹,

Fatourou²,

Palpanas³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-socket and multi-core architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. MESSI supports similarity search using both the Euclidean and Dynamic Time Warping (DTW) distances. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in ∼50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

show abstract

Section: Related Workmentioning

confidence: 99%

Fast Data Series Indexing for In-Memory Data

Peng¹,

Fatourou²,

Palpanas³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the recent years, there has been much research on similarity searches and the subsequent data indexing [4]- [6]. In the context of time-series data indexing, an example query related to a similarity search can include finding past days in which the temperature recording is similar to today's pattern.…”

Section: Introductionmentioning

confidence: 99%

“…In particular, we observe that clients not only focus on finding a trend (up or down) or a similar pattern in time-series data in a period of time, they also expect to obtain summarized information on such time series. The term 'summarized information' that we refer in this paper is not likely ''summarizations'' that proposed in [6], which are representations of time-series data segments. Our term means summarized outcomes extracted from a segment of data by relevant user-defined functions.…”

Section: Introductionmentioning

confidence: 99%

Integration of IoT Streaming Data With Efficient Indexing and Storage Optimization

et al. 2020

View full text Add to dashboard Cite

In the era of IoT, the world of connected experiences is created by the convergence of multiple technologies including real-time analytics, machine learning, and commodity sensors and embedded systems. However, with the proliferation of these IoT technologies and devices, there are challenges in integrating, indexing and managing time-series data from multiple sources to optimize the storage of those data and/or retrieve the information from them in real-time. Many researchers have addressed the data integration issue through developing time-series data compression techniques; however, they focused mainly on the application of integer value compression to IoT data. Moreover, existing work does not focus on the issues of data and information retrieval without decompression. In this paper, we solve these issues by constructing an indexing framework within a lossless compression for floating point time-series data, where an index is based on the time-stamp from the compressed data that facilitates the search for data without full decompression. We conduct several sets of experiments and quantify the performance of our proposed approach. The experimental results, performed on IoT datasets, show a reduction in storage compared with existing compression techniques. The experimental study also demonstrates the capability of time-series data indexing and integration in real-time. INDEX TERMS Data integration, indexing, time-series data compression, floating point compression, decompression, IoT streaming data, window-based compression and integration.

show abstract

“…However, similarity search in very large data series collections is notoriously challenging [70,49,50,50,18,17,13,14,2], due to the high dimensionality (length) of the data series. In order to address this problem, a significant amount of effort has been dedicated by the data management research community to data series indexing techniques [51,13,14], which lead to fast and scalable similarity search [16,56,29,4,62,24,66,11,12,71,72,68,69,53,55,54,9,31,32,33]. Predefined constraints.…”

mentioning

confidence: 99%

“…We note that the technique discussed above (despite its limitations) is indeed the current state of the art, and no other technique has been proposed since, even though during the same period of time we have witnessed lots of activity and a steady stream of papers on the single-length similarity search problem (e.g., [29,4,62,10,66,11,71,72,68,69,53,55,54,31,32,33]). This attests to the challenging nature of the problem we are tackling in this paper.…”

mentioning

confidence: 99%

Scalable data series subsequence matching with ULISSE

Linardi

Palpanas

2020

The VLDB Journal

Self Cite

View full text Add to dashboard Cite

Data series similarity search is an important operation and at the core of several analysis tasks and applications related to data series collections. Despite the fact that data series indexes enable fast similarity search, all existing indexes can only answer queries of a single length (fixed at index construction time), which is a severe limitation. In this work, we propose ULISSE, the first data series index structure designed for answering similarity search queries of variable length (within some range). Our contribution is twofold. First, we introduce a novel representation technique, which effectively and succinctly summarizes multiple sequences of different length. Based on the proposed index, we describe efficient algorithms for approximate and exact similarity search, combining disk based index visits and in-memory sequential scans. Our approach supports non Z-normalized and Z-normalized sequences, and can be used with no changes with both Euclidean Distance and Dynamic Time Warping, for answering both k-NN andrange queries. We experimentally evaluate our approach using several synthetic and real datasets. The results show that ULISSE is several times, and up to orders of magnitude more efficient in terms of both space and time cost, when compared to competing approaches. (Paper published in VLDBJ 2020) 1 Introduction Motivation. Data sequences are one of the most common data types, and they are present in almost every scientific and

show abstract

Coconut: sortable summarizations for scalable indexes over static and streaming data series

Cited by 21 publications

References 67 publications

Fast Data Series Indexing for In-Memory Data

Fast Data Series Indexing for In-Memory Data

Integration of IoT Streaming Data With Efficient Indexing and Storage Optimization

Scalable data series subsequence matching with ULISSE

Contact Info

Product

Resources

About