cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs

Shekofteh, S. Kazem; Noori, Hamid; Naghibzadeh, Mahmoud; Fröning, Holger; Yazdi, Hadi Sadoghi

doi:10.1109/tpds.2019.2944602

Cited by 17 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For an application with ample scope for concurrency, we have observed that rather than relying on traditional coarse-grained scheduling decisions, implementing fine-grained scheduling policies using PySchedCL where the user specifies an intuitive task component partitioning T after examining the structure of a DAG application results in significantly better execution times. Future work entails investigating sophisticated low-level scheduling approaches such as sub-kernel partitioning [9], [25] at the work-item level for effective interleaving of concurrent kernels. Such approaches coupled with Machine Learning assisted control theoretic scheduling solutions [26] shall be used to develop an auto-tuning framework on top of PySchedCL which would automatically determine given an applicationarchitecture pair, the optimal allocation of command queues across devices in the platform.…”

Section: Discussionmentioning

confidence: 99%

“…Another interesting observation would be that the individual execution times for each kernel increases slightly as a result of interleaving. This is due to the fact, that different work groups of different kernels that have been concurrently dispatched are scheduled in a round robin fashion to the compute units of the device, thus causing resource contention [9]. However, the total time for finishing kernels concurrently is lesser than the case when they are dispatched in sequence.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

Ghose¹,

Singh²,

Kulaharia³

et al. 2020

Preprint

View full text Add to dashboard Cite

In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Such languages, however, involve a steep learning curve as well as developing an extensive understanding of the underlying architecture of the compute devices in heterogeneous platforms. This has led to the emergence of several High Performance Computing frameworks which provide high-level abstractions for easing the development of dataparallel applications on heterogeneous platforms. However, the scheduling decisions undertaken by such frameworks only exploit coarsegrained concurrency in data parallel applications. In this paper, we propose PySchedCL, a framework which explores fine-grained concurrency aware scheduling decisions that harness the power of heterogeneous CPU/GPU architectures efficiently. We showcase the efficacy of such scheduling mechanisms over existing coarse-grained dynamic scheduling schemes by conducting extensive experimental evaluations for a Machine Learning based inferencing application.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

Ghose¹,

Singh²,

Kulaharia³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Wen et al [20] propose a graph-based algorithm to schedule kernels in pairs. The recent work of Shekofteh et al [6] proposes the co-scheduling of pair of kernels with different execution behaviors. They use kernel slicing to improve the choice of kernels pairs for co-scheduling.…”

Section: Related Workmentioning

confidence: 99%

“…However, GPU resource allocation is performed by the hardware, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task, if there are sufficient leftover resources [5]. This allocation policy has shown to lead to an unreasonable use of resources [2], and to be influenced by the order in which the kernels are launched [6], [7]. Therefore, the launching of kernels to the GPU has to be wisely performed in order to avoid an unbalanced occupation of the GPU resources and its consequent negative effect on the system performance.…”

Section: Introductionmentioning

confidence: 99%

Algorithms for Preemptive Co-scheduling of Kernels on GPUs

Eyraud-Dubois

Bentes

2020

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

Modern GPUs allow concurrent kernel execution and preemption to improve hardware utilization and responsiveness. Currently, the decision on the simultaneous execution of kernels is performed by the hardware, which can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs in high competition scenarios. We propose a novel graphbased preemptive co-scheduling algorithm, with the focus on reducing the number of preemptions. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time. Based on this solution we propose graph theoretical model and an algorithm to build preemptive schedules which minimizes the number of preemptions. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks.

show abstract