Scheduling Spark Tasks With Data Skew and Deadline Constraints

Gu, Haihua; Li, Xiaoping; Lu, Zhipeng

doi:10.1109/access.2020.3040719

Cited by 5 publications

(1 citation statement)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For improving performance, data locality is a key factor considered by the task scheduling of Spark stages [10]. The task scheduling determines the executor on which node the task runs and the data locality refers to scheduling task close to data, so that the communication overload can be reduced [19], [15]. In particular, in the map stage, the taskScheduler uses the delay scheduling algorithm [34] that tries to assign the map task to the node which stores the data block, and in the reduce stage, the taskScheduler assigns the reduce task to one of the nodes that holds more intermediate data to the task, thus to minimize the data transfer volume.…”

Section: Introductionmentioning

confidence: 99%

Optimizing data locality by executor allocation in spark computing environment

Tang

et al. 2023

ComSIS

View full text Add to dashboard Cite

Data locality is an important concept in big data processing. Most of the existing research optimized data locality from the aspect of task scheduling. However, as the execution container of tasks, the executors started on which nodes can directly affect the locality level achieved by the tasks. This paper tries to improve the data locality by executor allocation for reduce stage in Spark computing environment. Firstly, we calculate the network distance matrix of executors and formulate an optimal executor allocation problem to minimize the total communication distance. Then, when the network distance between executors satisfies the triangular inequality, an approximate algorithm is proposed; and when the network distance between executors does not satisfy the triangular inequality, a greedy algorithm is proposed. Finally, we evaluate the performance of our algorithms in a practical Spark cluster by using several representative micro-benchmarks (Sort and Join) and macro-benchmarks (PageRank and LDA). Experimental results show that the proposed algorithms can decrease the execution time of tasks for lower data communication.

show abstract

Section: Introductionmentioning

confidence: 99%