Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

Ding, Jialin; Nathan, Vikram; Alizadeh, Mohammad; Kraska, Tim

doi:10.48550/arxiv.2006.13282

Cited by 4 publications

(4 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ma et al [42] used mixture density networks for AQP. Database indexing research recently has adopted neural networks to approximate cumulative density functions [9,10,30,49]. Query optimization and join ordering are also benefiting from neural networks [27,45].…”

Section: Related Work 61 Learned Database Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

Kurmanji,

Triantafillou

2023

Proc. ACM Manag. Data

View full text Add to dashboard Cite

Machine Learning (ML) is changing DBs as many DB components are being replaced by ML models. One open problem in this setting is how to update such ML models in the presence of data updates. We start this investigation focusing on data insertions (dominating updates in analytical DBs). We study how to update neural network (NN) models when new data follows a different distribution (a.k.a. it is "out-of-distribution" -- OOD), rendering previously-trained NNs inaccurate. A requirement in our problem setting is that learned DB components should ensure high accuracy for tasks on old and new data (e.g., for approximate query processing (AQP), cardinality estimation (CE), synthetic data generation (DG), etc.). This paper proposes a novel updatability framework (DDUp). DDUp can provide updatability for different learned DB system components, even based on different NNs, without the high costs to retrain the NNs from scratch. DDUp entails two components: First, a novel, efficient, and principled statistical-testing approach to detect OOD data. Second, a novel model updating approach, grounded on the principles of transfer learning with knowledge distillation, to update learned models efficiently, while still ensuring high accuracy. We develop and showcase DDUp's applicability for three different learned DB components, AQP, CE, and DG, each employing a different type of NN. Detailed experimental evaluation using real and benchmark datasets for AQP, CE, and DG detail DDUp's performance advantages.

show abstract

Section: Related Work 61 Learned Database Systemsmentioning

confidence: 99%

“…Cardinality/selectivity estimation, has improved considerably leveraging ML [17,70,77,78,84]. Likewise for query optimization [27,44,45], indexes [9,10,30,49], cost estimation [63,83], workload forecasting [85], DB tuning [34,68,81], synthetic data generation [7,54,76], etc.…”

Section: Introductionmentioning

confidence: 99%

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

Kurmanji,

Triantafillou

2023

Proc. ACM Manag. Data

View full text Add to dashboard Cite

show abstract

“…This increase in performance is what Kraska et al hoped to achieve when they first introduced their work on Learned Index Structure (LIS) models [21]. Even though the concept of a LIS is still new, it has already led to a surge of inspiring results that leverage ideas from Machine Learning (ML), data structures, and database systems [7], [6], [31], [15], [1], [32], [5], [30], [23], [16], [11], [19], [8], [13], [27].…”

Section: Introductionmentioning

confidence: 99%

Testing the Robustness of Learned Index Structures

Bachfischer¹,

Borovica-Gajić²,

Rubinstein³

2022

Preprint

View full text Add to dashboard Cite

While early empirical evidence has supported the case for learned index structures as having favourable average-case performance, little is known about their worst-case performance. By contrast, classical structures are known to achieve optimal worst-case behaviour. This work evaluates the robustness of learned index structures in the presence of adversarial workloads. To simulate adversarial workloads, we carry out a data poisoning attack on linear regression models that manipulates the cumulative distribution function (CDF) on which the learned index model is trained. The attack deteriorates the fit of the underlying ML model by injecting a set of poisoning keys into the training dataset, which leads to an increase in the prediction error of the model and thus deteriorates the overall performance of the learned index structure. We assess the performance of various regression methods and the learned index implementations ALEX and PGM-Index. We show that learned index structures can suffer from a significant performance deterioration of up to 20% when evaluated on poisoned vs. non-poisoned datasets.

show abstract

“…ML techniques enable automatic, fine-grained, and more accurate characterization of the problem space and benefit a variety of tasks in DBMS. Specifically, unsupervised ML techniques can model the data distribution for cardinality estimation (CardEst) [14,39,41,42,46] and indexing [6,7,18,27]; supervised ML models can replace the cost estimator (CostEst) [25,34,35] and execution scheduler [23,31]; and reinforcement learning methods solve decision making # The second and third authors contribute equally to this paper. problems such as configuration tuning [1,20,44] and join order selection (JoinSel) [12,22,24,29,43].…”

Section: Introductionmentioning

confidence: 99%

A Unified Transferable Model for ML-Enhanced DBMS

Wu¹,

Yan²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, the database management system (DBMS) community has witnessed the power of machine learning (ML) solutions for DBMS tasks. Despite their promising performance, these existing solutions can hardly be considered satisfactory. First, these MLbased methods in DBMS are not effective enough because they are optimized on each specific task, and cannot explore or understand the intrinsic connections between tasks. Second, the training process has serious limitations that hinder their practicality, because they need to retrain the entire model from scratch for a new DB. Moreover, for each retraining, they require an excessive amount of training data, which is very expensive to acquire and unavailable for a new DB. We propose to explore the transferabilities of the ML methods both across tasks and across DBs to tackle these fundamental drawbacks.In this paper, we propose a unified model MTMLF that uses a multi-task training procedure to capture the transferable knowledge across tasks and a pre-train fine-tune procedure to distill the transferable meta knowledge across DBs. We believe this paradigm is more suitable for cloud DB service, and has the potential to revolutionize the way how ML is used in DBMS. Furthermore, to demonstrate the predicting power and viability of MTMLF, we provide a concrete and very promising case study on query optimization tasks. Last but not least, we discuss several concrete research opportunities along this line of work.

show abstract

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

Cited by 4 publications

References 30 publications

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

Testing the Robustness of Learned Index Structures

A Unified Transferable Model for ML-Enhanced DBMS

Contact Info

Product

Resources

About