A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Yu, Yulong; Xiao, Weijun; He, Xubin; Guo, He; Wang, Yuxin; Chen, Xin

doi:10.1145/2751205.2751234

Cited by 16 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…EXPARS can improve TLP by enabling more CTAs per SM through expanding register file to scratchpad memory. However, previous works [5,13,19,43] have shown that higher TLP does not always mean higher performance due to resource contention. To alleviate the contention, we propose a Lazy Two-Level Warp Scheduler (LTLWS), which is inspired by Reference [19], to control the maximum number of schedulable warps (active warps) during runtime.…”

Section: A Lazy Two-level Warp Schedulermentioning

confidence: 87%

“…Jing et al [12] introduce an integrated architecture that enables the register file to support the function of cache, which also has above weaknesses. GPU warp scheduling: GPU warp scheduling is a hot research point in recent years [19,20,31,36,43]. Lee et al [20] first propose a profiling algorithm to find the critical warps and then schedule these critical warps more frequently than others.…”

Section: Evaluation For Advanced Architecturementioning

confidence: 99%

See 1 more Smart Citation

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Bai

Sun

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch between massive threads and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPUs is usually limited by the inefficient resource management of register file and scratchpad memory. This inefficiency also leads to register file and scratchpad memory underutilization. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. When the available register file becomes limited, our approach leverages the underutilized scratchpad memory to support additional register allocation. Therefore, more CTAs can be dispatched to SMs, which improves the GPU utilization. Our experiments on representative benchmark suites show that the number of CTAs dispatched to each SM increases by 1.28× on average. In addition, our approach improves the GPU resource utilization significantly, with the register file utilization improved by 11.64% and the scratchpad memory utilization improved by 48.20% on average. With better TLP, our approach achieves 20.01% performance improvement on average with negligible energy overhead.

show abstract

Section: A Lazy Two-level Warp Schedulermentioning

confidence: 87%

Section: Evaluation For Advanced Architecturementioning

confidence: 99%

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Bai

Sun

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…In addition, Kayiran et al [11] proposed a dynamic CTA scheduling technique that attempts to allocate optimal number of CTAs per core based on application demands, demonstrating that executing the maximum number of CTAs per core is not always the best solution to boost performance due to high cache and memory contention. Yu et al [33] presented a Stall-Aware Warp Scheduling (SAWS) policy, which dynamically optimizes the TLP according to pipeline stalls. SAWS can effectively improve pipeline efficiency by reducing structural hazards without introducing new data hazards.…”

Section: Related Work Using Hints In Microprocessorsmentioning

confidence: 99%

Haws

Gong

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.

show abstract

“…Recent research on GPGPUs has led to the optimization of thread level parallelism and maximizing the execution of cooperative thread arrays [1][2][3]. This has made GPGPUs more viable for high performance computation.…”

Section: Introductionmentioning

confidence: 99%

Memory-aware circuit overlay NoCs for latency optimized GPGPU architectures

Raparti

Pasricha

2016

2016 17th International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

The growing parallelism in most of today's applications has led to an increased demand for parallel computing in processors. General Purpose Graphics Processing Units (GPGPUs) have been used extensively to provide the necessary computation for highly parallel applications. GPGPUs generate huge volumes of network traffic between memory controllers (MCs) and cores. As a result, the network-on-chip (NoC) fabric can become a performance bottleneck, especially for memory intensive applications on GPGPUs. Traditional mesh-based NoC topologies are not suitable for GPGPUs as they possess high network latency that leads to congestion at MCs and an increase in application execution time. In this paper, we propose a novel memory-aware circuit overlay NoC that exploits characteristics of traffic in GPGPUs to eliminate router arbitration at each hop. Our experimental results show that our approach yields an improvement of 40-75% in NoC latency, 20-70% in execution time, and 10-65% in overall energy consumption compared to the state-of-the-art.

show abstract

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Cited by 16 publications

References 35 publications

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Haws

Memory-aware circuit overlay NoCs for latency optimized GPGPU architectures

Contact Info

Product

Resources

About