A Comparative Study of Join Algorithms in Spark

Phan, Anh-Cang; Phan, Thuong-Cang; Trieu, Thanh-Ngoan

doi:10.1007/978-3-030-63924-2_11

Cited by 4 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate the algorithms based on general cost models and experiments in Spark. This research extends our previous work [22]. The new contributions include a more complete and systematic presentation on the two-way join algorithms; and a comparative study on complexly recursive join algorithms using theory and empirical models in Spark.…”

Section: Introductionmentioning

confidence: 56%

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

Phan

Trieu

et al. 2021

SN COMPUT. SCI.

Self Cite

View full text Add to dashboard Cite

Currently, the estimated amount of data created daily have reached the threshold of petabytes or even zettabytes globally. It is no wonder that traditional data processing technologies cannot process and manage extremely large volumes of such data. However, these massive and various data can be used to deal with business problems that we would not have been able to tackle before. To discover their value, it is necessary to effectively perform query operations in a parallel and distributed manner. One of the standard and common query operations is an expensive join operation. This research systematically presents a theoretical and experimental comparison of the prominent join algorithms in the Spark environment. At first, this study shows the details of important strategies of two-way joins and recursive joins. Then, it exposes the advantages and disadvantages of each approach. Especially, the work provides mathematical cost models to make a more convince comparison of the joins before verifying by experiments. The results show that the comparison using the cost models is consistent with that using the experiments. Generally, the two-way and recursive joins using filters are the best choices while performing in the Spark environment.

show abstract

Section: Introductionmentioning

confidence: 56%

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

Phan

Trieu

et al. 2021

SN COMPUT. SCI.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Multiway joins withe different entropy theories should be examined in the future. Besides, multiway join algorithms that considered data skewness in different distributed computing architectures such as Apache Spark [43] can be further studied on the basis of our research. Nonetheless, this study provides a novel method using MapReduce to achieve logically flexible partitions for join algorithms on Hadoop.…”

Section: Discussionmentioning

confidence: 99%

MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Chen

Zhang

2021

Scientific Programming

View full text Add to dashboard Cite

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

show abstract

“…Experimental results demonstrated that the most effective and efficient distributed spatial join algorithm depends on the characteristics of the two input datasets; broadcast join is generally fastest when one of the datasets is modest in size (and only one is large) but cannot complete when both datasets are large. In [35], a comparative study of common join algorithms in MapReduce was provided. The join algorithms (map-side join, reduce-side join, broadcast join, bloom join and intersection bloom join) based on general cost model and experiments in Spark were evaluated.…”

Section: Spatial Analytics Systemmentioning

confidence: 99%

“…Table 2 shows the syntheses of the implementations directly on Apache Spark of distributed algorithms with sophisticated processing techniques for other spatial queries, not using the previous SASs. Generic framework using clustering methods [28] In-memory partitioning and indexing system (SparkNN) SJQ [33] Spatial Join with Spark (SJS), uniform grid partitioning [34] Distributed join methods: Broadcast Join and Bin Join [35] Comparative study of common join algorithms in Spark TKSJQ [36] Uniform grid partitioning and improved plane-sweeping KNNJQ [37] Locality-Sensitive Hashing (LSH) algorithm in Spark MwSJQ [38] Multiway Spatial Join algorithm in Spark (MSJS), using cascaded pairwise join technique STSQ [39] Spark-based spatio-textual skyline query alg. (Multi-PSS) KCPQ, DJQ [40] SliceNBound (SnB), parent-child and common-merged strip partitioning and, plane-sweep technique [41] Strip-based partitioning and plane-sweep technique [42] Binary Space Partitioning (BSP).…”

Section: Spatial Analytics Systemmentioning

confidence: 99%

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Moutafis,

Mavrommatis,

Vassilakopoulos

et al. 2021

IJGI

View full text Add to dashboard Cite

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

show abstract

A Comparative Study of Join Algorithms in Spark

Cited by 4 publications

References 15 publications

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Contact Info

Product

Resources

About