A scalable multi-path microarchitecture for efficient GPU control flow

ElTantawy, Ahmed; Wenjie, Jessica; O'Connor, Mike; Aamodt, Tor M.

doi:10.1109/hpca.2014.6835936

“…For the per-slice scoreboards, we use a larger FlipFlop cell (3.6µm 2 scaled to 40nm) from the NanGate library and 3× area overhead factor to account for the comparators necessary for an associative lookup. Compared to the scoreboard described in [9], ours has fewer bits and noticeably less area. Finally, to estimate the area cost of the additional control logic required for slicing the SIMD datapath, we examine published literature on the percentage of total core area other processors devote to control [34,4,21,20].…”

Section: Gang Splitting Policiesmentioning

confidence: 92%

“…and (3) techniques that allow the interleaved execution of multiple branch paths within a warp by scheduling multiple paths from the control flow stack [31,9]. The fundamental characteristic that sets I-VWS and the use of small warps apart from prior work is the ability to concurrently issue many more unique PCs by scaling and distributing instruction fetch bandwidth.…”

Section: Qualitative Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

A variable warp size architecture

Rogers

¹

,

Johnson²,

O'Connor

³

et al. 2015

Proceedings of the 42nd Annual International Symposium on Computer Architecture

Self Cite

View full text Add to dashboard Cite

This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.

show abstract

“…Rhu and Erez [29] examine a dualpath execution model provided by two PC reconvergence stacks and two register scoreboards in order to expose the warp scheduler to more parallelism when facing divergent execution paths. To extend this solution, [30] replaces the reconvergence stack with two warp split and warp reconvergence tables. Rogers et.…”

Section: Related Work On Divergencementioning

confidence: 99%

Efficient warp execution in presence of divergence with collaborative context collection

Khorasani

¹

,

Gupta

²

,

Bhuyan

³

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet powerefficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software (compiler) technique named Collaborative Context Collection (CCC) that increases the warp execution efficiency when faced with thread divergence incurred either by different intra-warp task assignment or by intra-warp load imbalance. CCC collects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared memory, and restores them only when the perfect utilization of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to variety of program segments with thread divergence. We also introduce optimizations to reduce the cost of CCC and to avoid device occupancy limitation or memory divergence. We have developed a framework that automates application of CCC to CUDA generated intermediate PTX code. We evaluated CCC on real-world applications and multiple scenarios using synthetic programs. CCC improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69x (maximum 3.08x).

show abstract

“…Prior work such as [11,12,5,29,30,26,31,9,24] proposes various techniques to improve Single Instruction Multiple Data (SIMD) efficiency or increase thread level parallelism for divergent applications on GPUs. However, the use of small warps is the only way to improve both SIMD efficiency and thread level parallelism in divergent code.…”

Section: Introductionmentioning

confidence: 99%

A variable warp size architecture

Rogers

¹

,

Johnson²,

O’Connor

³

et al. 2015

SIGARCH Comput. Archit. News

Self Cite

View full text Add to dashboard Cite

This paper studies the effect of warp sizing and scheduling on performance and efficiency in GPUs. We propose Variable Warp Sizing (VWS) which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence. When appropriate, our proposed technique groups sets of these smaller warps together by ganging their execution in the warp scheduler, improving performance and energy efficiency for regular applications. Warp ganging is necessary to prevent performance degradation on regular workloads due to memory convergence slip, which results from the inability of smaller warps to exploit the same intra-warp memory locality as larger warps. This paper explores the effect of warp sizing on control flow divergence, memory divergence, and locality. For an estimated 5% area cost, our ganged scheduling microarchitecture results in a simulated 35% performance improvement on divergent workloads by allowing smaller groups of threads to proceed independently, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.

show abstract

A scalable multi-path microarchitecture for efficient GPU control flow

Cited by 30 publications

References 20 publications

A variable warp size architecture

A variable warp size architecture

Efficient warp execution in presence of divergence with collaborative context collection

A variable warp size architecture

Contact Info

Product

Resources

About