EffiSha

Chen, Guoyang; Zhao, Yimin; Shen, Xipeng; Zhou, Huiyang

doi:10.1145/3018743.3018748

Cited by 65 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It schedules GPU kernels by adjusting the number and size of logical TBs spawned in one launch. EffiSha [29] dispatches logical TBs on the basis of the scheduler's decisions. These approaches use the ends of logical TBs as scheduling points.…”

Section: Related Workmentioning

confidence: 99%

“…To address these problems, we introduce a thin TB scheduler, inspired by the idea of Elastic kernels [38] and EffiSha [29]. Our TB scheduler, which is a software mechanism running inside the device, puts all the TBs in its queue and only executes the same number of TBs as concurrently runnable TBs on the SMs.…”

Section: Thread Block Controlmentioning

confidence: 99%

“…Scientific apps [5], [6] exclusively use GPUs to compute their simulations. Existing GPU resource managers, including GPU command-based schedulers [24]- [26], novel GPU kernel launchers [27], [28], and thread block schedulers [29], [30], fail to schedule GPU eaters appropriately since GPU eaters do not provide scheduling points such as kernel launches or thread block completion; thus, a hosted GPU eater may monopolize the GPU. Other techniques, such as context funneling [31], [32] and persistent threads [33], effectively schedule GPU eaters but fail to isolate GPGPU apps; thus, a hosted GPGPU app may access and modify the memory of other GPGPU apps, which is not suitable for multi-tenant cloud platforms.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Suzuki

Yamada

Kato

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Graphics processing units (GPUs) have become an attractive platform for general-purpose computing (GPGPU) in various domains. Making GPUs a time-multiplexing resource is a key to consolidating GPGPU applications (apps) in multi-tenant cloud platforms. However, advanced GPGPU apps pose a new challenge for consolidation. Such highly functional GPGPU apps, referred to as GPU eaters, can easily monopolize a shared GPU and starve collocated GPGPU apps. This paper presents GLoop, which is a software runtime that enables us to consolidate GPGPU apps including GPU eaters. GLoop offers an event-driven programming model, which allows GLoop-based apps to inherit the GPU eaters' high functionality while proportionally scheduling them on a shared GPU in an isolated manner. We implemented a prototype of GLoop and ported eight GPU eaters on it. The experimental results demonstrate that our prototype successfully schedules the consolidated GPGPU apps on the basis of its scheduling policy and isolates resources among them.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Thread Block Controlmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Suzuki

Yamada

Kato

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…As a result, the behavior of collectives on GPU intrinsically meets the resource-holding condition of a deadlock situation. ❸ There is no publicly accessible official preemptive scheduling support for GPUs, and the GPU-preemption techniques in literature [6,16,24,25,30,42,51,54,59] are not suitable for collectives (see Sec. 6 for details).…”

Section: Background and Motivation 21 Collectives And Deadlocks In Di...mentioning

confidence: 99%

“…Software methods [6,16,24,25,54,59] can be applied directly on commodity GPUs. Wait-based preemption approaches [6,54,59] modify user kernels to insert scheduling points so that user kernels quit more frequently and expose more scheduling opportunities. Lee et al [24,25] and REEF [16] kill the preempted kernel directly to decrease scheduling delay.…”

Section: Related Workmentioning

confidence: 99%

Moral hazard in securitization

Zhuang

2017

Peking University Law Journal

View full text Add to dashboard Cite

Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78% compared to statically sequenced NCCL, while introducing overheads of less than 6.5% across various distributed DNN training approaches. INTRODUCTIONRecent years have witnessed that the number of the state-of-theart (SOTA) deep neural network (DNN) models' parameters grows much faster than a single GPU's memory capacity and computational power [1,14,48]. This entails distributed DNN training, which includes various techniques such as data parallelism [27, 47], tensor parallelism [4, 50, 56], and pipeline parallelism [19,32,33], as well as hybrid parallelism [4,33,45], etc. Collective communication plays a critical role in distributed DNN training.Widely used collectives on GPU are deadlock-prone [40] because preemption is ill-supported on GPUs and collectives work in a resource-holding and busy-looping way on GPUs. As a result, the only chance to prevent collective-related deadlocks in distributed DNN training is to guarantee that all collectives are invoked in a consistent order on each of the GPUs.

show abstract

Heuristics for concurrent task scheduling on GPUs

López-Albelda

Lázaro-Muñoz

González-Linares

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Concurrent execution of tasks in GPUs can reduce the computation time of a workload by overlapping data transfer and execution commands. However, it is difficult to implement an efficient runtime scheduler that minimizes the workload makespan as many execution orderings should be evaluated. In this paper, we employ scheduling theory to build a model that takes into account the device capabilities, workload characteristics, constraints, and objective functions. In our model, GPU tasks scheduling is reformulated as a flow shop scheduling problem, which allow us to apply and compare well‐known heuristics already developed in the operations research field. In addition, we develop a new heuristic, specifically focused on executing GPU commands, that achieves better scheduling results than previous ones. It leverages on a precise GPU command execution model for both computation and data transfers to carry out more advantageous scheduling decisions. A comprehensive evaluation, showing the suitability and robustness of this new approach, is conducted in three different NVIDIA architectures (Kepler, Maxwell, and Pascal). Results confirm the proposed heuristic achieves the best results in more than 90% of the experiments. Furthermore, a comparison has been made with MPS (Multi‐Process Service), the NVIDIA API that deals with the execution of concurrent tasks, which shows that our solution obtains speed‐ups ranging from 1.15 to 1.20.

show abstract

EffiSha

Cited by 65 publications

References 26 publications

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Cooperative GPGPU Scheduling for Consolidating Server Workloads

Moral hazard in securitization

Heuristics for concurrent task scheduling on GPUs

Contact Info

Product

Resources

About