Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs

Ukidave, Yash; Kalra, Charu; Kaeli, David; Mistry, Perhaad; Schaa, Dana

doi:10.1109/sbac-pad.2014.43

Cited by 17 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Adriaens et al [1] proposed the use of spatial multitasking to group SMs into different sets that can run different kernels (up to four) in order to maximize application speedup. Ukidave et al [23] studied the real-time support for adaptive spatial partitioning on GPUs and highlighted the importance of L2 in this process. Aguilera et al [2] demonstrated the unfairness of spatial multitasking and proposed a fair resource allocation strategy for both performance and fairness.…”

Section: Related Workmentioning

confidence: 99%

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Masola,

Capodieci,

Cavicchioli

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Masola,

Capodieci,

Cavicchioli

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Aguilera et al [2] improve the fairness of Spatial Multitasking by balancing the individual performance and the overall performance. Ukidave et al [49] extend the OpenCL run-time environment to explore several dynamic spatial multiprogramming approaches. Compared to the architectural approaches, the software approaches require source code modification.…”

Section: Related Work Gpu Concurrent Kernel Executionmentioning

confidence: 99%

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Lin

Dai

Mantor

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive kernels. Even among bandwidth-intensive kernels, more intensive kernels may unfairly consume much higher bandwidth than less-intensive ones. In this article, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach dynamically detects co-running kernels as latency sensitive or bandwidth intensive. As both the DRAM bandwidth and L2-to-L1 Network-on-Chip (NoC) bandwidth can be the critical resource, our approach partitions both bandwidth resources coordinately along with selecting proper CTA combinations. The key objective is to allocate more CTA resources for latencysensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRAM-intensive kernels. We achieve it using a variation of dominant resource fairness (DRF). Compared with two state-of-the-art CKE optimization schemes, SMK [52] and WS [55], our approach improves the average harmonic speedup by 78% and 39%, respectively. Even compared to the best possible CTA combinations, which are obtained from an exhaustive search among all possible CTA combinations, our approach improves the harmonic speedup by up to 51% and 11% on average.

show abstract

“…Round-robin CTA scheduling may lead to imbalanced execution in two specific scenarios. First, a small kernel, due to algorithmic limitations or due to a small input data set [26], [27], [28], may occupy only a subset of the SMs and may lead to imbalance across the SMs, i.e., some SMs are assigned more CTAs than others. Second, when co-executing multiple kernels through spatial multitasking, some local crossbars may be over-utilized while others are under-utilized.…”

Section: Topology-aware Cta Schedulingmentioning

confidence: 99%

CD-Xbar: A Converge-Diverge Crossbar Network for High-Performance GPUs

Zhao

Wang

et al. 2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5% and 48.5%, respectively, while at the same time improving performance by 13.9% on average. CD-Xbar performs within 2.9% of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance per Watt (by 17.1%) over state-of-the-art GPU NoCs which are highly customized and non-scalable. Index Terms-graphics processing unit (GPU), network-on-chip (NoC), crossbar !

show abstract

Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs

Cited by 17 publications

References 13 publications

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

CD-Xbar: A Converge-Diverge Crossbar Network for High-Performance GPUs

Contact Info

Product

Resources

About