Efficient processing of data warehousing queries in a split execution environment

Bajda-Pawlikowski, Kamil; Abadi, Daniel J.; Silberschatz, Abraham; Paulson, Erik K.

doi:10.1145/1989323.1989447

Cited by 68 publications

(38 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Greenplum and Aster Data have added the ability to execute MapReduce-style functions over data stored in these systems. HadoopDB and split execution [6,10] explore on the architectural level exploiting hybrid MapReduce and relational database systems. Dremel, another project from Google [26], is worth mentioning as an example of a new generation of database systems that are massively distributed and run interactive queries on very large data sets.…”

Section: Related Workmentioning

confidence: 99%

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Engle¹

2013

View full text Add to dashboard Cite

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Copyright © 2013, by the author(s).All rights reserved.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.Shark: Fast Data Analysis Using Coarse-grained Distributed Memory Clifford EngleAbstract Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets. This is a complete overview of the development of Shark, including design decisions, performance details, and comparison with existing data warehousing solutions. It demonstrates some of Shark's distinguishing features including its in-memory columnar caching and its unified machine learning interface.

show abstract

Section: Related Workmentioning

confidence: 99%

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Engle¹

2013

View full text Add to dashboard Cite

show abstract

“…Furthermore, MapReduce is accompanied by a plethora of free tools as well as having cluster availability and support. Hive [11], Pig [37], Scope [20], and HadoopDB [10,38] are projects that provide SQL abstractions on top of MapReduce platform to familiarize the programmers with complex queries. SQL/MapReduce [39] and Greenplum [21] …”

Section: Related Workmentioning

confidence: 99%

Improving the performance of Hadoop Hive by sharing scan and computation tasks

et al. 2014

View full text Add to dashboard Cite

MapReduce is a popular programming model for executing time-consuming analytical queries as a batch of tasks on large scale data clusters. In environments where multiple queries with similar selection predicates, common tables, and join tasks arrive simultaneously, many opportunities can arise for sharing scan and/or join computation tasks. Executing common tasks only once can remarkably reduce the total execution time of a batch of queries. In this study, we propose a Multiple Query Optimization framework, SharedHive, to improve the overall performance of Hadoop Hive, an open source SQL-based data warehouse using MapReduce. SharedHive transforms a set of correlated HiveQL queries into a new set of insert queries that will produce all of the required outputs within a shorter execution time. It is experimentally shown that SharedHive achieves significant reductions in total execution times of TPC-H queries.

show abstract

“…For instance, HadoopDB [1] (which forms the basis of its commercial version, Hadapt) uses relational databases to perform MapReduce tasks. Microsoft PolyBase [4] improves the scalability of SQL Server through "split query processing" [2], which transforms queries into MapReduce jobs. Sailfish [20] accelerates MapReduce by batching disk I/Os.…”

Section: Related Workmentioning

confidence: 99%

Design and implementation of a real-time interactive analytics system for large spatio-temporal data

et al. 2014

View full text Add to dashboard Cite

In real-time interactive data analytics, the user expects to receive the results of each query within a short time period such as seconds. This is especially challenging when the data is big (e.g., on the scale of petabytes), and the analytics system runs on top of cloud infrastructure (e.g., thousands of interconnected commodity servers). We have been building such a system, called OceanRT, for managing large spatio-temporal data such as call logs and mobile web browsing records collected by a telecommunication company. Although there already exist systems for querying big data in real time, OceanRT's performance stands out due to several novel designs and components that address key efficiency and scalability issues that were largely overlooked in existing systems. First, OceanRT makes extensive use of software RDMA one-sided operations, which reduce networking costs without requiring specialized hardware. Second, OceanRT exploits the parallel computing capabilities of each node in the cloud through a novel architecture consisting of Access-Query Engines (AQEs) connected with minimal overhead. Third, OceanRT contains a novel storage scheme that optimizes for queries with joins and multi-dimensional selections, which are common for large spatiotemporal data. Experiments using the TPC-DS benchmark show that OceanRT is usually more than an order of magnitude faster than the current state-of-the-art systems.

show abstract

Efficient processing of data warehousing queries in a split execution environment

Cited by 68 publications

References 18 publications

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Shark: Fast Data Analysis Using Coarse-grained Distributed Memory

Improving the performance of Hadoop Hive by sharing scan and computation tasks

Design and implementation of a real-time interactive analytics system for large spatio-temporal data

Contact Info

Product

Resources

About