Turbo-charging estimate convergence in DBO

Dobra, Alin; Jermaine, Chris; Rusu, Florin; Xu, Fei

doi:10.14778/1687627.1687675

Cited by 28 publications

(34 citation statements)

References 17 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is because Verdict reaches a target error bound much earlier by combining its model with the raw answer of the AQP engine. [6,19,24,36,45,66,87]: Instead of continuously refining approximate answers and reporting them to the user, these engines simply take a time-bound from the user, and then they predict the largest sample size that they can process within the requested time-bound; thus, they minimize error bounds within the allotted time. For these engines, Verdict simply replaces the user's original time bound t 1 with a slightly smaller value t 1 − ǫ before passing it down to the AQP engine, where ǫ is the time needed by Verdict for inferring the improved answer and improved error.…”

Section: Deployment Scenariosmentioning

confidence: 99%

Database Learning

Park

Tajik

Cafarella

et al. 2017

Proceedings of the 2017 ACM International Conference on Management of Data

View full text Add to dashboard Cite

In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries.We call this novel idea-learning from past query answersDatabase Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0× for the same accuracy level compared to existing AQP systems.

show abstract

Section: Deployment Scenariosmentioning

confidence: 99%

Database Learning

Park

Tajik

Cafarella

et al. 2017

Proceedings of the 2017 ACM International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…The database online aggregation literature has its origins in the seminal paper by Hellerstein et al [21]. We can broadly categorize this body of work into system design [30,7,13,2], online join algorithms [20,8,34], and methods to derive confidence bounds [19]. All of this work is targeted at single-node centralized environments.…”

Section: Related Workmentioning

confidence: 99%

Speculative Approximations for Terascale Distributed Gradient Descent Optimization

Qin

Rusu

2015

Proceedings of the Fourth Workshop on Data Analytics in the Cloud

Self Cite

View full text Add to dashboard Cite

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a timeconsuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes.In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution.We apply the proposed techniques to distributed gradient descent optimization -batch and incremental -for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -a state-of-the-art Big Data analytics system -and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20 th fraction of the time.

show abstract

“…In particular, sampling has served as one of the most common and generic approaches to approximation of analytical queries [7,8,13,14,15,21,23,30,31]. The simplest form of sampling is simple random sampling with plug-in estimation.…”

Section: Approximate Query Processing (Aqp)mentioning

confidence: 99%

“…Nearly three decades ago, Olken and Rotem [27] introduced random sampling in relational databases as a means to return approximate answers and reduce query response times. A large body of work has subsequently proposed different sampling techniques [7,8,13,14,15,21,23,30,31,36]. All of this Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.…”

Section: Introductionmentioning

confidence: 99%

“…Unfortunately, deriving closed forms is often a manual, analytical process. As a result, closed-formbased S-AQP systems [7,8,13,14,15,21,31] are restricted to very simple SQL queries (often with only a single layer of basic aggregates like AVG, SUM, COUNT, VARIANCE and STDEV with projections, filters, and a GROUP BY). This has motivated the use of resampling methods like the bootstrap [23,30], since these methods require no such detailed analysis and can be applied to arbitrarily complex SQL queries.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Knowing when you're wrong

Agarwal

Milner

Kleiner

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

115

View full text Add to dashboard Cite

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions. The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

show abstract

Turbo-charging estimate convergence in DBO

Cited by 28 publications

References 17 publications

Database Learning

Database Learning

Speculative Approximations for Terascale Distributed Gradient Descent Optimization

Knowing when you're wrong

Contact Info

Product

Resources

About