Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Dai, Hongwen; Lin, Zhen; Li, Chao; Zhao, Chen; Wang, Fei; Zheng, Nanning; Zhou, Huiyang

doi:10.1109/hpca.2018.00027

Cited by 33 publications

(33 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We first consider when a kernel is in the stalled state, and p(k, M std ) is the probability that a kernel is stalled. In this case, the interconnection of a GPU memory system is busy in fetching or storing data, and the memory pipeline may be stalled by cache-miss-related resource saturation [6,23]. This will impose the memory operations of all other delayed corunning kernels.…”

Section: Slowdown Caused By Conflictsmentioning

confidence: 99%

“…Recently, contentions from the GPU memory system drew extensive attention. Work [6] shows performance improvements by reducing memory pipeline stalls. They balance the memory accesses of concurrent kernels and limit the number of in-flight memory instructions issued by each kernel.…”

Section: Related Workmentioning

confidence: 99%

“…The speedup, speedup, shows the performance improvement (Equation (20)). Harmonic speedup HS [10,14] is a balanced metric for both system throughput and fairness (Equation (21)), and weighted speedup W S [6,14] is a metric for system throughput (Equation (22)). For these three metrics, the higher one is the better one.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Liu

Lin

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

As a critical computing resource in multiuser systems such as supercomputers, data centers, and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially partitioned sharing (SPS), which coexecutes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions.However, it is hard to implement SMK on current GPU architecture, because (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce and (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this article, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric, symbiotic factor, which can evaluate an SMK scheme so that kernels with complementary resource utilization can corun within a CU. Also, we analyze the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of corunning kernel pairs show 11%, 18%, and 12% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Compared to the Warped-Slicer, the results show 29%, 18%, and 51% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. 7:2 H. Wu et al.applications [30], and cloud applications [26]. Furthermore, with increasing computing power and new architecture features, new-generation GPUs can support larger and more complex computing tasks. Observation [21,29] has shown the on-chip resource underutilization of single-kernel execution. Therefore, while GPUs become more general, the underutilization of GPUs is becoming a more critical issue in modern systems. Efficiently sharing GPUs for general-purpose computing on GPU (GPGPU) applications is of great importance. Programmers write a GPGPU program using CUDA [19] or OpenCL [8] programming models and offload computing to GPU as kernels.Corunning kernels have drawn extensive attention both in industry and academia [1, 6, 10, 13, 14, 16, 17, 20-22, 28, 29, 31, 33]. The resources used by a kernel include both static resources (threads, registers, and shared memory) and dynamic resources (computing cores, memory load/store units, bandwidth, and memory interconnection). Modern GPU architectures, like NVIDIA Kepler [20] and AMD GCN [16], support c...

show abstract

Section: Slowdown Caused By Conflictsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Liu

Lin

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…the assigned TLP resources per SM; hence decreasing the assigned TLP resources leads to a severe per-SM performance drop. In addition, co-executing two kernels on the same SM unavoidably leads to intra-SM contention in various resources including the L1 cache and/or the load/store units [13]. Intra-SM contention may slowdown one kernel or in some cases both kernels.…”

Section: Why Existing Solutions Failmentioning

confidence: 99%

HeteroCore GPU to Exploit TLP-Resource Diversity

Zhao

Wang

Eeckhout

2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Graphics processing units (GPUs) are widely adopted as compute accelerators in cloud computing environments and supercomputers. Sharing GPU resources in such environments requires effective multitasking support. Unfortunately, conventional GPUs lack the ability to adapt to diverse thread-level parallelism (TLP) resource demands among co-executing kernels. Previous work such as SM partitioning and simultaneously multitasking (SMK) increase system throughput, however, they degrade per-application performance significantly. This paper proposes the HeteroCore GPU to significantly improve multitasking performance with a similar area cost as a conventional GPU. After rebalancing TLP-related SM resources, a HeteroCore GPU consists of two types of SMs to support diverse TLP-resource demands. Dynamic scheduling performs low-overhead spatial profiling during runtime across the different SM types and steers scheduling decisions based on the TLP-resource demands of the co-executing kernels. Compared to a conventional GPU, HeteroCore GPU improves system throughput by 20.1% on average (up to 80.9%) and per-application performance by 29.8% on average (up to 50.3%), for workload mixes composed of kernels with different TLP-resource demands.

show abstract

“…As mentioned in the introduction, simultaneous multikernel (SMK) execution [9,10] is another approach to improve resource utilization in a fine-grained way within an SM. Although previous work showed that SMK works well for mixes of applications with different execution characteristics [9,10,13], Hongwen et al [31,32] more recently pointed out that even under a state-of-art intra-SM sharing scheme, performance still suffers due to interference among concurrent applications. Here we implement SMK following [10] which improves performance by dynamically partitioning SM resources.…”

Section: Comparison Against Smkmentioning

confidence: 99%

Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs

Zhao

Wang

Eeckhout

2018

Proceedings of the 2018 International Conference on Supercomputing

View full text Add to dashboard Cite

Graphics processing units (GPUs) feature an increasing number of streaming multiprocessors (SMs) with each successive generation. At the same time, GPUs are increasingly widely adopted in cloud services and data centers to accelerate general-purpose workloads. Running multiple applications on a GPU in such environments requires effective multitasking support. Spatial multitasking in which independent applications co-execute on different sets of SMs is a promising solution to share GPU resources. Unfortunately, how to effectively partition SMs is an open problem. In this paper, we observe that compared to widely-used even partitioning, dynamic SM partitioning based on the characteristics of the co-executing applications can significantly improve performance and power efficiency. Unfortunately, finding an effective SM partition is challenging because the number of possible combinations increases exponentially with the number of SMs and co-executing applications. Through offline analysis, we find that first classifying workloads, and then searching an effective SM partition based on the workload characteristics can significantly reduce the search space, making dynamic SM partitioning tractable. Based on these insights, we propose Classification-Driven search (CD-search) for low-overhead dynamic SM partitioning in multitasking GPUs. CD-search first classifies workloads using a novel off-SM bandwidth model, after which it enters the performance mode or power mode depending on the workload's characteristics. Both modes follow a specific search strategy to quickly determine the optimum SM partition. Our evaluation shows that CD-search improves system throughput by 10.4% on average (and up to 62.9%) over even partitioning for workloads that are classified for the performance mode. For workloads classified for the power mode, CD-search reduces power consumption by 25% on average (and up to 41.2%). CD-search incurs limited runtime overhead. CCS CONCEPTS • Computer systems organization → Single instruction, multiple data;

show abstract

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Cited by 33 publications

References 36 publications

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

HeteroCore GPU to Exploit TLP-Resource Diversity

Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs

Contact Info

Product

Resources

About