Dynamic query scheduling in parallel data warehouses

Märtens, Holger; Rahm, Erhard; Stöhr, Thomas

doi:10.1002/cpe.786

Cited by 9 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the multi‐core processors have seen a very fast evolution in the last decade and the scalability of distributed systems has also improved significantly , one can assume that the most efficient way to lower the query processing time is to parallelize their execution. Numerous studies have been performed regarding the parallel execution of queries in a database or data warehouse . Hadoop‐DB is a hybrid data warehouse environment that uses several relational database management system (PostgreSql) as data nodes and Hadoop + Hive as the execution engine.…”

Section: Related Work and Backgroundmentioning

confidence: 99%

Single-scan: a fast star-join query processing algorithm

PURDILĂ

Pentiuc

2015

Softw. Pract. Exper.

View full text Add to dashboard Cite

Summary A data warehouse can store very large amounts of data that should be processed in parallel in order to achieve reasonable query execution times. The MapReduce programming model is a very convenient way to process large amounts of data in parallel on commodity hardware clusters. A very popular query used in data warehouses is star‐join. In this paper, we present a fast and efficient star‐join query execution algorithm built on top of a MapReduce framework called Hadoop. By using dynamic filters against dimension tables, the algorithm needs a single scan of the fact table, which means a significant reduction of input/output operations and computational complexity. Also, the algorithm requires only two MapReduce iterations in total–one to build the filters against dimension tables and one to scan the fact table. Our experiments show that the proposed algorithm performs much better than the existing solutions in terms of execution time and input/output. Copyright © 2014 John Wiley & Sons, Ltd.

show abstract

Section: Related Work and Backgroundmentioning

confidence: 99%

Single-scan: a fast star-join query processing algorithm

PURDILĂ

Pentiuc

2015

Softw. Pract. Exper.

View full text Add to dashboard Cite

show abstract

“…There is certainly place to design new data allocation functions [17], grid-base algorithms [8], distributed optimization techniques and associated workload scheduling policies [24]. Also several companies, e.g., Greenplum, Asterdata, Infobright, exploit the cluster and compute cloud infrastructures to increase the performance for business intelligence applications using modestly changed commodity open-source database systems.…”

Section: Introductionmentioning

confidence: 99%

The data cyclotron query processing scheme

Gonçalves

Kersten

2011

ACM Trans. Database Syst.

View full text Add to dashboard Cite

Distributed database systems exploit static workload characteristics to steer data fragmentation and data allocation schemes. However, the grand challenge of distributed query processing is to come up with a self-organizing architecture, which exploits all resources to manage the hot data set, minimize query response time, and maximize throughput without global co-ordination.In this paper, we introduce the Data Cyclotron architecture which addresses the challenges using turbulent data movement through a storage ring built from distributed main memory capitalizing modern remote-DMA facilities. Queries assigned to individual nodes interact with the Data Cyclotron by picking up data fragments continuously flowing around, i.e., the hot set. Each data fragment carries a level of interest (LOI) metric, which represents the cumulative query interest as the fragment passes around the ring multiple times. A fragment with a LOI below a given threshold, inversely proportional to the ring load, is pulled out to free up resources. This threshold is dynamically adjusted in a distributed manor based on ring characteristics and query needs. It optimizes the resource utilization keeping the average data access delay low.The proposed architecture has a modest impact on existing query execution engines. This is illustrated using an extensive validated simulation study for the Data Cyclotron protocols. The results underpin their robustness in turbulent workload scenarios as well as in the TPC-H scenario. Furthermore, we think that using state-ofthe-art network technology, e.g., RDMA, could lead to even more promising results.The Data Cyclotron architecture opens a new vista for modern distributed database architectures with a plethora of research challenges barely scratched upon.

show abstract

“…We conclude in Section 6. Details omitted due to space constraints can be found in an extended version of this paper [11].…”

Section: Introductionmentioning

confidence: 99%