Online aggregation for large MapReduce jobs

Pansare, Niketan; Borkar, Vinayak; Jermaine, Chris; Condie, Tyson

doi:10.14778/3402707.3402748

Cited by 121 publications

(52 citation statements)

References 15 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Users can stop the execution whenever the error bound meets their requirement. Some efforts have been focused on implementing online aggregation in MapReduce environments [7,14].…”

Section: Related Workmentioning

confidence: 99%

Error-bounded sampling for analytics on big sparse data

2014

View full text Add to dashboard Cite

Aggregation queries are at the core of business intelligence and data analytics. In the big data era, many scalable sharednothing systems have been developed to process aggregation queries over massive amount of data. Microsoft's SCOPE is a well-known instance in this category. Nevertheless, aggregation queries are still expensive, because query processing needs to consume the entire data set, which is often hundreds of terabytes. Data sampling is a technique that samples a small portion of data to process and returns an approximate result with an error bound, thereby reducing the query's execution time. While similar problems were studied in the database literature, we encountered new challenges that disable most of prior efforts: (1) error bounds are dictated by end users and cannot be compromised, (2) data is sparse, meaning data has a limited population but a wide range. For such cases, conventional uniform sampling often yield high sampling rates and thus deliver limited or no performance gains. In this paper, we propose error-bounded stratified sampling to reduce sample size. The technique relies on the insight that we may only reduce the sampling rate with the knowledge of data distributions. The technique has been implemented into Microsoft internal search query platform. Results show that the proposed approach can reduce up to 99% sample size comparing with uniform sampling, and its performance is robust against data volume and other key performance metrics.

show abstract

“…Users can stop the execution whenever the error bound meets their requirement. Some efforts have been focused on implementing online aggregation in MapReduce environments [7,14].…”

Section: Related Workmentioning

confidence: 99%

Error-bounded sampling for analytics on big sparse data

2014

View full text Add to dashboard Cite

show abstract

“…文献[48] 研究了核密度估计 (kernel density estimate) 这个重要的数据分析基础问题, 提出了随机和确定两类求解算法, 性能优于已有算法多个数量级. 文献[43] 基于 Map-Reduce, 提出了大数据在线聚集算法. 文献[121] 基于 Map-Reduce, 研究了流数据的集合关系分析问题, 提出了基于数据划分、冗余存储和计算负载平衡的高性能并行算法.…”

unclassified

Research progress in the complexity theory and algorithms of big-data computation

2016

Sci. Sin.-Inf.

View full text Add to dashboard Cite

show abstract

“…Answering this query requires accessing all location and air pollution measurements in the time period of interest, which can be substantial for long periods. To solve this problem, researchers have proposed approximate query processing algorithms (JERMAINE et al, 2007;AGARWAL et al, 2013;OOI;TAN, 2010;BABCOCK;DATAR;MOTWANI, 2004;PANSARE et al, 2011PANSARE et al, , 2011POTTI;PATEL, 2015;LAZARIDIS;) that approximate the query result by looking at a subset of the data.…”

Section: Publicationsmentioning

confidence: 99%

“…On the other hand, if the user demands a lower error, the algorithm will be able to satisfy the request by visiting lower levels of the segment trees (which exact nodes will be visited also depends on the query and the interplay of the time series in it). Leveraging the trees, PlatoDB can even provide users with continuously improving approximate answers and error guarantees, allowing them to stop the computation at any time, similar to works in online aggregation WANG, 1997;CONDIE et al, 2010;PANSARE et al, 2011).…”

Section: A2 System Architecturementioning

confidence: 99%

“…Approximate query answering with probabilistic error guarantees. Most of the existing work on approximate query processing has focused on using sampling to compute approximate query answers by appropriately evaluating the queries on small samples of the data (JERMAINE et al, 2007;AGARWAL et al, 2013;OOI;TAN, 2010;BABCOCK;DATAR;MOTWANI, 2004;PANSARE et al, 2011PANSARE et al, , 2011. Such approaches typically leverage statistical inequalities and the central limit theorem to compute the confidence interval or variance of the computed approximate answer.…”

Section: A9 Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop

Brito¹

View full text Add to dashboard Cite

The era of Big Data is here: the combination of unprecedented amounts of data collected every day with the promotion of open source solutions for massively parallel processing has shifted the industry in the direction of data-driven solutions. From recommendation systems that help you find your next significant one to the dawn of self-driving cars, Cloud Computing has enabled companies of all sizes and areas to achieve their full potential with minimal overhead. In particular, the use of these technologies for Data Warehousing applications has decreased costs greatly and provided remarkable scalability, empowering business-oriented applications such as Online Analytical Processing (OLAP). One of the most essential primitives in Data Warehouses are the Star Joins, i.e. joins of a central table with satellite dimensions. As the volume of the database scales, Star Joins become unpractical and may seriously limit applications. In this thesis, we proposed specialized solutions to optimize the processing of Star Joins. To achieve this, we used the Hadoop software family on a cluster of 21 nodes. We showed that the primary bottleneck in the computation of Star Joins on Hadoop lies in the excessive disk spill and overhead due to network communication. To mitigate these negative effects, we proposed two solutions based on a combination of the Spark framework with either Bloom filters or the Broadcast technique. This reduced the computation time by at least 38%. Furthermore, we showed that the use of full scan may significantly hinder the performance of queries with low selectivity. Thus, we proposed a distributed Bitmap Join Index that can be processed as a secondary index with loose-binding and can be used with random access in the Hadoop Distributed File System (HDFS). We also implemented three versions (one in MapReduce and two in Spark) of our processing algorithm that uses the distributed index, which reduced the total computation time up to 88% for Star Joins with low selectivity from the Star Schema Benchmark (SSB). Because, ideally, the system should be able to perform both random access and full scan, our solution was designed to rely on a two-layer architecture that is framework-agnostic and enables the use of a query optimizer to select which approaches should be used as a function of the query. Due to the ubiquity of joins as primitive queries, our solutions are likely to fit a broad range of applications. Our contributions not only leverage the strengths of massively parallel frameworks but also exploit more efficient access methods to provide scalable and robust solutions to Star Joins with a significant drop in total computation time.

show abstract

Online aggregation for large MapReduce jobs

Cited by 121 publications

References 15 publications

Error-bounded sampling for analytics on big sparse data

Error-bounded sampling for analytics on big sparse data

Research progress in the complexity theory and algorithms of big-data computation

Data Warehouses na era do Big Data: processamento eficiente de Junções Estrela no Hadoop

Contact Info

Product

Resources

About