Enabling preemptive multiprogramming on GPUs

Tanasic, Ivan; Gelado, Isaac; Cabezas, Javier; Ramírez, Alex; Navarro, Nacho; Valero, Mateo

doi:10.1109/isca.2014.6853208

Cited by 117 publications

(63 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, each SM has its own on-chip scratch-pad memory, which is shared by the threads within a thread block. For modern GPUs, the context of a single SM can be as large as 256kB of register file and 48kB of shared memory [1,24,29]. With such a large context, preempting with context switching has high overhead in both preemption latency and wasted throughput.…”

Section: Prior Preemption Techniquesmentioning

confidence: 99%

“…However, supporting preemptive multitasking on GPUs through context switching can incur a higher overhead compared to CPUs, where the context of an SM can be as large as 256kB of register file and 48kB of on-chip scratch-pad memory [1,24,29]. Not only does a kernel have to endure a long preemption latency, the GPU also wastes execution resources while context switching.…”

Section: Introductionmentioning

confidence: 99%

“…Not only does a kernel have to endure a long preemption latency, the GPU also wastes execution resources while context switching. Although Tanasic et al [29] has shown that the average normalized turnaround time can still be improved with high context switching overhead, such overhead wastes the GPU's computing power and may be ineffective for latency-sensitive applications [3,12,13].…”

Section: Introductionmentioning

confidence: 99%

“…Context switching [17,29] stores the context of currently running thread blocks, and preempts an SM with a new kernel. Draining [12,29] stops issuing new thread blocks to the SM and waits until the SM finishes its currently running thread blocks. Flushing drops the execution of running thread blocks and preempts the SM almost instantly.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Chimera

Park

Mahlke

2015

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

The demand for multitasking on graphics processing units (GPUs) is constantly increasing as they have become one of the default components on modern computer systems along with traditional processors (CPUs). Preemptive multitasking on CPUs has been primarily supported through context switching. However, the same preemption strategy incurs substantial overhead due to the large context in GPUs. The overhead comes in two dimensions: a preempting kernel suffers from a long preemption latency, and the system throughput is wasted during the switch. Without precise control over the large preemption overhead, multitasking on GPUs has little use for applications with strict latency requirements.In this paper, we propose Chimera, a collaborative preemption approach that can precisely control the overhead for multitasking on GPUs. Chimera first introduces streaming multiprocessor (SM) flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution. Chimera utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency. Evaluations show that Chimera violates the deadline for only 0.2% of preemption requests when a 15µs preemption latency constraint is used. For multi-programmed workloads, Chimera can improve the average normalized turnaround time by 5.5x, and system throughput by 12.2%.

show abstract

Section: Prior Preemption Techniquesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Chimera

Park

Mahlke

2015

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…As many applications do not require full GPU resources, spatial multitasking can improve total system throughput with concurrent execution of multiple applications compared to temporal multitasking [12,14].…”

Section: Introductionmentioning

confidence: 99%

Efficient GPU multitasking with latency minimization and cache boosting

Kim

Chu

Park

2017

IEICE Electron. Express

View full text Add to dashboard Cite

GPU spatial multitasking has been proven to be quite effective at executing different applications concurrently using SM partitioning. However, while it maximizes total throughput, latency-critical applications often cannot meet their deadlines due to the increased execution time. Furthermore, SM partitioning cannot allocate the appropriate L1 cache size per kernel. To solve these problems, this paper proposes a new application-aware resource allocation framework called GPU Fine-Tuner, for assigning appropriate resources to GPU kernels. To minimize the execution time of latencyconstrained applications, it assigns them more SMs when performance is not affected. It also increases the cache size of SMs for cache-sensitive kernels using resource borrowing from neighbors for cache-insensitive kernels. Experimental results show that the Fine-Tuner outperforms GPU spatial multitasking with up to 15% less average latency without performance degradation.

show abstract

Maximizing the GPU resource usage by reordering concurrent kernels submission

Cruz

Bentes

Breder

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary The increasing amount of resources available on current GPUs sparked new interest in the problem of sharing its resources by different kernels. While new generations of GPUs support concurrent kernel execution, their scheduling decisions are taken by the hardware at runtime. The hardware decisions, however, heavily depend on the order at which the kernels are submitted to execution. In this work, we propose a novel optimization approach to reorder the kernels invocation focusing on maximizing the resources utilization, improving the average turnaround time. We model the kernel assignments to the hardware resources as a series of knapsack problems and use a dynamic programming approach to solve them. We evaluate our method using kernels with different sizes and resource requirements. Our results show significant gains in the average turnaround time and system throughput compared to the kernels submission implemented in modern GPUs.

show abstract

Enabling preemptive multiprogramming on GPUs

Cited by 117 publications

References 22 publications

Chimera

Chimera

Efficient GPU multitasking with latency minimization and cache boosting

Maximizing the GPU resource usage by reordering concurrent kernels submission

Contact Info

Product

Resources

About