ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning

Savva, Fotis; Anagnostopoulos, Christos; Triantafillou, Peter

doi:10.48550/arxiv.2003.06613

Cited by 3 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…QuickSel [23] is also an earlier method of using neural network to solve the cardinality estimation problem by fitting the data distribution. ML-AQP [24] leverages the query workload-driven idea to define the AQP problem as a supervised learning task. It uses a regression model to mine the relationship between the mapping of queries to aggregate function values in the logs.…”

Section: Related Workmentioning

confidence: 99%

Cardinality estimation based on QDSPN for embedded databases under dynamic workload

Ding,

Su,

Shen

et al. 2024

Preprint

View full text Add to dashboard Cite

Cardinality estimation has been a pivotal and enduring research focus within database query optimization. While significant advancements have been made in estimating cardinalities for both individual tables and complex multi-table joins, there remains a notable gap in research pertaining to embedded database scenarios. Embedded databases are typically characterized by limited resources and a preponderance of dense, short-term hotspot queries. As a result, cardinality estimation within the constraints of embedded databases poses additional complexities and challenges. In this paper, we introduce a novel Query-driven Sum-Product Network (QDSPN), which leverages the capabilities of sum-product networks (SPNs) to learn from historical data and adapt to dynamic workload variations. This approach effectively mitigates the inherent challenges of SPNs, such as false cluster collisions and independence assumption errors, particularly under conditions of strongly correlated data. Furthermore, we propose a two-stage query clustering framework tailored for dynamic workload environments. This framework serves to guide the structural configuration of the sum-product network, enhancing its adaptability and efficiency. We conduct extensive experiments to validate the performance of QDSPN under dynamic workloads. The experimental results demonstrate the evident advantages of the proposed QDSPN, and highlight its potential for widespread adoption in embedded database systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Cardinality estimation based on QDSPN for embedded databases under dynamic workload

Ding,

Su,

Shen

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…However, majority of AQP systems use stratified sampling based on prior knowledge (which might not be always available) of the data distributions [17], [18], [19]. Specifically, it had been demonstrated that uniform random samples are less effective for answering "Group By" which are important when conducting data exploratory analysis while biased sampling show better efficiency for these sort of tasks [20].…”

Section: Related Workmentioning

confidence: 99%

“…As much as this method is similar to our proposed method, one significant difference rely in the way the DL model is being used: while in this method the model is used to generate samples distributed tightly similar to the dataset distributions and then execute the queries on these samples, our method rely on the intrinsic structure of the LSTM network to both learn the dataset distributions and answer the approximated result. similar to our approach, this work utilized ML models to approximate aggregated SQL queries [19]. Specifically, gradient Boosting Machines (GBM), XGBoost and LightGBM were trained to predict the aggregated queries' result.…”

Section: Related Workmentioning

confidence: 99%

Approximating Aggregated SQL Queries with LSTM Networks

Regev

Rokach

Shabtai

2021

2021 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Despite continuous investments in data technologies, the latency of querying data still poses a significant challenge. Modern analytic solutions require near real-time responsiveness both to make them interactive and to support automated processing. Current technologies (Hadoop, Spark, Dataflow) scan the dataset to execute queries. They focus on providing a scalable data storage to maximize task execution speed. We argue that these solutions fail to offer an adequate level of interactivity since they depend on continual access to data. In this paper we present a method for query approximation, also known as approximate query processing (AQP), that reduce the need to scan data during inference (query calculation), thus enabling a rapid query processing tool. We use LSTM network to learn the relationship between queries and their results, and to provide a rapid inference layer for predicting query results. Our method (referred as "Hunch") produces a lightweight LSTM network which provides a high query throughput. We evaluated our method using 12 datasets. The results show that our method predicted queries' results with a normalized root mean squared error (NRMSE) ranging from approximately 1% to 4%. Moreover, our method was able to predict up to 120,000 queries in a second (streamed together), and with a single query latency of no more than 2ms.

show abstract

“…For instance, the DBEst Query processing engine [6] trains models, notably regression models and density estimators, that provide accurate, efficient, and cost-effective responses to different types of aggregate queries. Learning-based AQP (LAQP) [7] and ML-AQP [8] methods build machine learning models based on historically executed queries. The former builds an error model to predict each incoming query's sampling-based estimation error, whereas the latter trains models that learn patterns to predict future query results with a bound error by applying prediction intervals constructed using Quantile Regression models.…”

Section: Introductionmentioning

confidence: 99%

GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

Fallahian,

Dorodchi,

Kreth

2024

MAKE

View full text Add to dashboard Cite

In data-driven systems, data exploration is imperative for making real-time decisions. However, big data are stored in massive databases that are difficult to retrieve. Approximate Query Processing (AQP) is a technique for providing approximate answers to aggregate queries based on a summary of the data (synopsis) that closely replicates the behavior of the actual data; this can be useful when an approximate answer to queries is acceptable in a fraction of the real execution time. This study explores the novel utilization of a Generative Adversarial Network (GAN) for the generation of tabular data that can be employed in AQP for synopsis construction. We thoroughly investigate the unique challenges posed by the synopsis construction process, including maintaining data distribution characteristics, handling bounded continuous and categorical data, and preserving semantic relationships, and we then introduce the advancement of tabular GAN architectures that overcome these challenges. Furthermore, we propose and validate a suite of statistical metrics tailored for assessing the reliability of GAN-generated synopses. Our findings demonstrate that advanced GAN variations exhibit a promising capacity to generate high-fidelity synopses, potentially transforming the efficiency and effectiveness of AQP in data-driven systems.

show abstract

ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning

Cited by 3 publications

References 35 publications

Cardinality estimation based on QDSPN for embedded databases under dynamic workload

Cardinality estimation based on QDSPN for embedded databases under dynamic workload

Approximating Aggregated SQL Queries with LSTM Networks

GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and Solutions

Contact Info

Product

Resources

About