The analytical bootstrap

Zeng, Kai; Shi, Guangming; Mozafari, Barzan; Zaniolo, Carlo

doi:10.1145/2588555.2588579

Cited by 74 publications

(4 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, the confidence interval [3.5, 5.5] with the confidence level 95% means that we have 95% confidence to ensure that the accurate result will fall into the interval [3.5, 5.5]. In our progressive execution model, the expected performance is that the width of Currently, there are three widely-used methods in AQP system to do error estimation: closed-form estimates based on either the central limit theorem (CLT) [26], large deviation inequalities such as Hoeffding bounds [12], and the bootstrap [8,30]. As discussed before, the Zip-F law of natural languages motivated us to use bootstrap techniques in our method.…”

Section: Quantifying Results Errormentioning

confidence: 99%

“…These techniques either compute an error bound much wider than the real which lost guidance to users or require data to follow the normal distribution while it's not suitable for natural languages. Another estimation technique, bootstrap [23,30], can be applied to arbitrary queries. However, before bootstrap techniques have poor performance to apply in our progressive execution model due to lots of duplicate computation.…”

Section: Error Estimationmentioning

confidence: 99%

“…Unfortunately, these techniques either compute an error bound much wider than the real which lost guidance to users or require data to follow the normal distribution while the distribution of terms frequency often obeys the Zipf law [6]. This has motivated the use of resampling methods like bootstrap [30], which requires no such normal distribution and can be applied to arbitrary queries. However, traditional bootstrap and its variant, variational subsampling technique proposed by VerdictDB [23] remain high complexity in our progressive execution model due to lots of duplicate computation.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Parrot: A Progressive Analysis System on Large Text Collections

Zhang

et al. 2020

Data Sci. Eng.

View full text Add to dashboard Cite

The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4×–19.7× faster to get a result within 1% error while the confidence interval always covers the accurate results very well.

show abstract

Section: Quantifying Results Errormentioning

confidence: 99%

Section: Error Estimationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Parrot: A Progressive Analysis System on Large Text Collections

Zhang

et al. 2020

Data Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…The analytical bootstrap method [25], reduces the overhead of the bootstrap error estimation, removing the need for re-sampling.…”

Section: Analyticalmentioning

confidence: 99%

Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping Study

2019

View full text Add to dashboard Cite

When exploring big amounts of data without a clear target, providing an interactive experience becomes really difficult, since this tentative inspection usually defeats any early decision on data structures or indexing strategies. This is also true in the physics domain, specifically in high-energy physics, where the huge volume of data generated by the detectors are normally explored via C++ code using batch processing, which introduces a considerable latency. An interactive tool, when integrated into the existing data management systems, can add a great value to the usability of these platforms. Here, we intend to review the current state-of-the-art of interactive data exploration, aiming at satisfying three requirements: access to raw data files, stored in a distributed environment, and with a reasonably low latency. This paper follows the guidelines for systematic mapping studies, which is well suited for gathering and classifying available studies. We summarize the results after classifying the 242 papers that passed our inclusion criteria. While there are many proposed solutions that tackle the problem in different manners, there is little evidence available about their implementation in practice. Almost all of the solutions found by this paper cover a subset of our requirements, with only one partially satisfying the three. The solutions for data exploration abound. It is an active research area and, considering the continuous growth of data volume and variety, is only to become harder. There is a niche for research on a solution that covers our requirements, and the required building blocks are there.INDEX TERMS Big data applications, data analysis, data engineering, data exploration, database systems, interactive systems, systematic mapping study. APPENDIX RESULTS OF THE MAPPING STUDYSee Tables.

show abstract

Skew‐aware online aggregation over joins through guided sampling

Wang

Jin

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Online aggregation is a query processing technique that returns approximate answers with error guarantees (in the form of confidence intervals) continuously during the query execution process.This approach offers users a suitable tradeoff between query efficiency and accuracy. The key issue of online aggregation is how to ensure a random sample collection's efficiency and effectiveness. However, the often-used "blind" sampling method does not adequately consider dataset statistics and other useful information, leading to inefficient sampling and poor sample quality.This becomes a glaring performance issue for skewed data distribution over joins. To alleviate this problem, we utilize dataset statistics to propose a new "guided" sampling approach, which consists of a logic-partition-based weighted Gaussian sampling method tailored for the skewed join key, as well as a two-level sample allocation method that applies to the skewed measured value.Extensive experiments using the TPC-H benchmark for skewed data distribution demonstrate our solution's superior performance.

show abstract

The analytical bootstrap

Cited by 74 publications

References 40 publications

Parrot: A Progressive Analysis System on Large Text Collections

Parrot: A Progressive Analysis System on Large Text Collections

Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping Study

Skew‐aware online aggregation over joins through guided sampling

Contact Info

Product

Resources

About