An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures

Proceedings of the Third International Conference on Web Information Systems and Technologies

2007

Abstract:SQL queries involving join and group-by operations are fairly common in many decision support applications where the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. Several parallel algorithms that treat this kind of queries have been presented in the literature. However, their most significant drawbacks are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that overcomes these drawbacks because it evaluates the "GroupBy-Join" query without the need of the direct evaluation of the costly join operation, thus reducing its Input/Output and communication costs. Furthermore, the performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a linear speedup even for highly skewed data.

Section: Phase 3: Creating the Communication Templatesmentioning

confidence: 99%

“…At the end of steps 4.a and 4.b, each processor i, has local knowledge of how the tuples of semi-joins (Bamha, 2005), we can deduce that the tuples of…”

Section: B Redistribution Of Tuples With Valuesmentioning

confidence: 99%

See 1 more Smart Citation

An Optimal Evaluation of Groupby-Join Queries in Distributed Architectures

Proceedings of the Third International Conference on Web Information Systems and Technologies

2007

“…However, these algorithms cannot solve load imbalance problem as they base their routing decisions on incomplete or statistical information. On the contrary, the algorithms we presented in (Bamha and Hains, 1999;Bamha and Hains, 2000;Bamha, 2005) for treating queries involving one join operation use a total data-distribution information in the form of histograms. The parallel cost model we apply allows us to guarantee that histogram management has a negligible cost when compared to the efficiency gains it provides to reduce the communication cost and to avoid load imbalance between processors.…”

Section: Introductionmentioning

confidence: 99%

Pipelined Parallelism in Multi-Join Queries on Heterogeneous Shared Nothing Architectures

Proceedings of the Third International Conference on Software and Data Technologies Special Session on Applications in Banking

2008

Abstract:Pipelined parallelism was largely studied and successfully implemented, on shared nothing machines, in several join algorithms in the presence of ideal conditions of load balancing between processors and in the absence of data skew. The aim of pipelining is to allow flexible resource allocation while avoiding unnecessary disk input/output for intermediate join results in the treatment of multi-join queries.The main drawback of pipelining in existing algorithms is that communication and load balancing remain limited to the use of static approaches (generated during query optimization phase) based on hashing to redistribute data over the network and therefore cannot solve data skew problem and load imbalance between processors on heterogeneous multi-processor architectures where the load of each processor may vary in a dynamic and unpredictable way. In this paper, we present a new parallel join algorithm allowing to solve the problem of data skew while guaranteeing perfect balancing properties, on heterogeneous multi-processor Shared Nothing architectures. The performance of this algorithm is analyzed using the scalable portable BSP (Bulk Synchronous Parallel) cost model.

“…The main difficulty in such applications is that the result of these analytical queries must be obtained interactively (Datta et al, 1998;Tsois and Sellis, 2003) despite the huge volume of data in warehouses and their rapid growth especially in OLAP systems (Datta et al, 1998). For this reason, parallel processing of these queries is highly recommended in order to obtain acceptable response time (Bamha, 2005). Research has shown that join, which is one of the most expensive operations in DBMS, is parallelizable with near-linear speed-up only in ideal cases (Bamha and Hains, 2000).…”

Section: Introductionmentioning

confidence: 99%

Parallel Processing of “Group-By Join” Queries on Shared Nothing Machines

Communications in Computer and Information Science

Abstract:SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expensive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.