Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

Xu, Qing; Jeon, Hyeran; Kim, Keun Soo; Ro, Won Woo; Annavaram, Murali

doi:10.1109/isca.2016.29

Cited by 59 publications

(66 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to the optimized hardware-based concurrent kernel execution, whose kernel launching order brings fast execution time, the results of corunning kernel pairs show 11%, 18%, and 12% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Compared to the Warped-Slicer [31], the results show 29%, 18%, and 51% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Our contributions are:…”

Section: Introductionmentioning

confidence: 94%

“…We evaluate our SF for all 153 pairs of benchmarks in Table 1. We compare our selected scheme with the original scheme of the hardware scheduler and the scheme proposed by the Warped-Slicer [31]. Since the kernel launching order can affect the execution time, we choose the execution time of optimized order for the hardware scheduler, and the speedup is shown as ORI points in Figure 13.…”

Section: Smk Scheduling Evaluation For Kernel Pairsmentioning

confidence: 99%

“…Since the kernel launching order can affect the execution time, we choose the execution time of optimized order for the hardware scheduler, and the speedup is shown as ORI points in Figure 13. In Figure 13, the WAS points show the speedup of the execution time with an SMK scheme by the Warped-Slicer (WAS) [31]; the MDL points show the speedup of the execution time with an optimized SMK scheme by our performance model. Here, if the SMK scheme proposed by MDL or WAS is the same as ORI, we use the default hardware scheduler, because ORI does not involve the overhead of kernel slicing and kernel stretching.…”

Section: Smk Scheduling Evaluation For Kernel Pairsmentioning

confidence: 99%

“…However, their LEFTOVER scheduling policy of GPU hardware schedulers [1,21] decreases the concurrency, because the first launched kernel may use up one of the static resources of a GPU and make other kernels unable to be dispatched. Solutions proposed in recent research to improve the concurrency of GPU kernels can be classified into two categories: (1) spatially partitioned sharing (SPS), which coexecutes different kernels on disjointed sets of compute units (CUs) [1,10,13,17], and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU [14,21,22,29,31,33]. In general, SMK can improve the resource utilization even more than SPS, because SMK can launch more threads on a CU by corunning kernels with both complementary static resource requirements and interleaving instructions from kernels with low dynamic resource contentions, while SPS only allows different kernels to corun on disjointed sets of CUs [22].However, there is a lack of software solutions for applying SMK to GPUs.…”

mentioning

confidence: 99%

See 3 more Smart Citations

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Liu

Lin

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

As a critical computing resource in multiuser systems such as supercomputers, data centers, and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially partitioned sharing (SPS), which coexecutes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions.However, it is hard to implement SMK on current GPU architecture, because (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce and (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this article, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric, symbiotic factor, which can evaluate an SMK scheme so that kernels with complementary resource utilization can corun within a CU. Also, we analyze the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of corunning kernel pairs show 11%, 18%, and 12% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Compared to the Warped-Slicer, the results show 29%, 18%, and 51% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. 7:2 H. Wu et al.applications [30], and cloud applications [26]. Furthermore, with increasing computing power and new architecture features, new-generation GPUs can support larger and more complex computing tasks. Observation [21,29] has shown the on-chip resource underutilization of single-kernel execution. Therefore, while GPUs become more general, the underutilization of GPUs is becoming a more critical issue in modern systems. Efficiently sharing GPUs for general-purpose computing on GPU (GPGPU) applications is of great importance. Programmers write a GPGPU program using CUDA [19] or OpenCL [8] programming models and offload computing to GPU as kernels.Corunning kernels have drawn extensive attention both in industry and academia [1, 6, 10, 13, 14, 16, 17, 20-22, 28, 29, 31, 33]. The resources used by a kernel include both static resources (threads, registers, and shared memory) and dynamic resources (computing cores, memory load/store units, bandwidth, and memory interconnection). Modern GPU architectures, like NVIDIA Kepler [20] and AMD GCN [16], support c...

show abstract

Section: Introductionmentioning

confidence: 94%

Section: Smk Scheduling Evaluation For Kernel Pairsmentioning

confidence: 99%

Section: Smk Scheduling Evaluation For Kernel Pairsmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Liu

Lin

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Preemption mechanism [12] and dynamic scheduling [2] are orthogonal to our research, and therefore can be applied to our system. Recently, simultaneous multi-kernel, (SMK) which executes multiple kernels within a same SM, is proposed to improve the utilization of resources inside SMs [15,16]. Though these hardware algorithms can optimize thread block allocation for high resource utilization inside SMs, the control logic is too complex.…”

Section: Case Study: Pf+bp and Hs+smmentioning

confidence: 99%

Efficient GPU multitasking with latency minimization and cache boosting

Kim

Chu

Park

2017

IEICE Electron. Express

View full text Add to dashboard Cite

GPU spatial multitasking has been proven to be quite effective at executing different applications concurrently using SM partitioning. However, while it maximizes total throughput, latency-critical applications often cannot meet their deadlines due to the increased execution time. Furthermore, SM partitioning cannot allocate the appropriate L1 cache size per kernel. To solve these problems, this paper proposes a new application-aware resource allocation framework called GPU Fine-Tuner, for assigning appropriate resources to GPU kernels. To minimize the execution time of latencyconstrained applications, it assigns them more SMs when performance is not affected. It also increases the cache size of SMs for cache-sensitive kernels using resource borrowing from neighbors for cache-insensitive kernels. Experimental results show that the Fine-Tuner outperforms GPU spatial multitasking with up to 15% less average latency without performance degradation.

show abstract

Heuristics for concurrent task scheduling on GPUs

López-Albelda

Lázaro-Muñoz

González-Linares

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Concurrent execution of tasks in GPUs can reduce the computation time of a workload by overlapping data transfer and execution commands. However, it is difficult to implement an efficient runtime scheduler that minimizes the workload makespan as many execution orderings should be evaluated. In this paper, we employ scheduling theory to build a model that takes into account the device capabilities, workload characteristics, constraints, and objective functions. In our model, GPU tasks scheduling is reformulated as a flow shop scheduling problem, which allow us to apply and compare well‐known heuristics already developed in the operations research field. In addition, we develop a new heuristic, specifically focused on executing GPU commands, that achieves better scheduling results than previous ones. It leverages on a precise GPU command execution model for both computation and data transfers to carry out more advantageous scheduling decisions. A comprehensive evaluation, showing the suitability and robustness of this new approach, is conducted in three different NVIDIA architectures (Kepler, Maxwell, and Pascal). Results confirm the proposed heuristic achieves the best results in more than 90% of the experiments. Furthermore, a comparison has been made with MPS (Multi‐Process Service), the NVIDIA API that deals with the execution of concurrent tasks, which shows that our solution obtains speed‐ups ranging from 1.15 to 1.20.

show abstract

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

Cited by 59 publications

References 41 publications

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Efficient GPU multitasking with latency minimization and cache boosting

Heuristics for concurrent task scheduling on GPUs

Contact Info

Product

Resources

About