Exploiting Data Skew for Improved Query Performance

Zhang, Wangda; Ross, Kenneth A.

doi:10.1109/tkde.2020.3006446

Cited by 8 publications

(1 citation statement)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their FastJoin system improved the performance in terms of latency and throughput. Zhang and Ross [12] presented an index structure to reorder data so that popular items were concentrated in the cache hierarchy. They analyzed the cache behavior and efficiently processed database queries in the presence of skew.…”

Section: Introductionmentioning

confidence: 99%

Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark

Phan

Cao

et al. 2022

Applied Sciences

View full text Add to dashboard Cite

In the era of data deluge, Big Data gradually offers numerous opportunities, but also poses significant challenges to conventional data processing and analysis methods. MapReduce has become a prominent parallel and distributed programming model for efficiently handling such massive datasets. One of the most elementary and extensive operations in MapReduce is the join operation. These joins have become ever more complex and expensive in the context of skewed data, in which some common join keys appear with a greater frequency than others. Some of the reduction tasks processing these join keys will finish later than others; thus, the benefits of parallel computation become meaningless. Some studies on the problem of skew joins have been conducted, but an adequate and systematic comparison in the Spark environment has not been presented. They have only provided experimental tests, so there is still a shortage of representations of mathematical models on which skew-join algorithms can be compared. This study is, therefore, designed to provide the theoretical and practical basics for evaluating skew-join strategies for large-scale datasets with MapReduce and Spark—both analytically with cost models and practically with experiments. The objectives of the study are, first, to present the implementation of prominent skew-join algorithms in Spark, second, to evaluate the algorithms by using cost models and experiments, and third, to show the advantages and disadvantages of each one and to recommend strategies for the better use of skew joins in Spark.

show abstract

Section: Introductionmentioning

confidence: 99%