Continuous sampling for online aggregation over multiple queries

Wu, Sai; Ooi, Beng Chin; Tan, Kian-Lee

doi:10.1145/1807167.1807238

Cited by 62 publications

(46 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Standard statistical formulas can help us get unbiased estimators and estimate the confidence interval. A lot of previous work [13][14][15][16] have made great contributions on this problem.…”

Section: Samplingmentioning

confidence: 99%

See 1 more Smart Citation

Approximate Calculation of Window Aggregate Functions via Global Random Sample

Song

Liu

et al. 2018

Data Sci. Eng.

View full text Add to dashboard Cite

Window functions have been a part of the SQL standard since 2003 and have been studied extensively during the past decade. They are widely used in data analysis; almost all the current mainstream commercial databases support window functions. However, in recent years the size of datasets is growing steeply; the existing window function implementations are not efficient enough. Recently, some sampling-based algorithms (e.g., online aggregation) are proposed to deal with large and complex data in relational databases, which offer us a flexible trade-off between accuracy and efficiency. However, few sampling techniques has been considered for window functions in databases. In this paper, we extend our previous work (Song et al. in Asia-Pacific web and web-age information management joint conference on web and big data, Springer, pp 229-244, 2017) and proposed two new algorithms: range-based global sampling algorithm and rowlabeled sampling algorithm. The proposed algorithms use global sampling rather than local sampling and are more efficient than other existing algorithms. And we find our proposed algorithms out performed the baseline method over the TPC-H benchmark dataset.

show abstract

Section: Samplingmentioning

confidence: 99%

“…Since then, research on online aggregation has been actively pursued. Xu et al [14] studied online aggregation with group by clause and Wu et al [16] proposed a continuous sampling algorithm for online aggregation over multiple queries. Qin and Rusu [27] extended online aggregate to distributed and parallel environments.…”

Section: Related Workmentioning

confidence: 99%

Approximate Calculation of Window Aggregate Functions via Global Random Sample

Song

Liu

et al. 2018

Data Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…Another extension to MapReduce has been to address continuous processing such as stream processing [Stephens 1997;Golab and Özsu 2010] or online aggregation [Hellerstein et al 1997;Wu et al 2010b]. Recall that a sort-merge process is accomplished by the mapper and reducer modules.…”

Section: Streams and Continuous Query Processingmentioning

confidence: 99%

Distributed data management using MapReduce

et al. 2014

Self Cite

View full text Add to dashboard Cite

MapReduce is a framework for processing and managing large scale data sets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research efforts have been directed towards making it more usable and efficient for supporting database-centric operations. In this paper we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.

show abstract

“…There are many different methods to sample a data warehouse I [1,12,13,17,19] and we consider two specific techniques:…”

Section: Sampling a Data Warehousementioning

confidence: 99%

“…This subject has become important in the context of streaming data [4,15,19]. In our approach, we consider the L1 distance between distributions: two answers are ε-close if the L1 distance between two distributions is less than ε.…”

Section: Introductionmentioning

confidence: 99%

Approximate answers to OLAP queries on streaming data warehouses

Rougemont

Cao²

2012

Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP

View full text Add to dashboard Cite

We study streaming data for a data warehouse, which combines different sources. We consider the relative answers to OLAP queries on a schema, as distributions with the L1 distance and approximate the answers without storing the entire data warehouse. We first study how to sample each source and combine the samples to approximate any OLAP query. We then consider a streaming context, where a data warehouse is built by streams of different sources. We first show a lower bound on the size of the memory necessary to approximate queries and then consider a statistical hypothesis where some attributes determine fixed distributions of the measure. We use the sampling methods to learn the statistical model and approximate OLAP queries. In this case, we approximate OLAP queries with a finite memory. We apply the method to a dataset which simulates the data of sensors, which provide weather parameters over time and locations from different sources.

show abstract

Continuous sampling for online aggregation over multiple queries

Cited by 62 publications

References 24 publications

Approximate Calculation of Window Aggregate Functions via Global Random Sample

Approximate Calculation of Window Aggregate Functions via Global Random Sample

Distributed data management using MapReduce

Approximate answers to OLAP queries on streaming data warehouses

Contact Info

Product

Resources

About