2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) 2014
DOI: 10.1109/hpca.2014.6835936
|View full text |Cite
|
Sign up to set email alerts
|

A scalable multi-path microarchitecture for efficient GPU control flow

Abstract: Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a warp thus degrading performance. In this paper, we propose a novel branch divergence handling mechanism that enabl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 30 publications
(11 citation statements)
references
References 20 publications
0
10
0
Order By: Relevance
“…For the per-slice scoreboards, we use a larger FlipFlop cell (3.6µm 2 scaled to 40nm) from the NanGate library and 3× area overhead factor to account for the comparators necessary for an associative lookup. Compared to the scoreboard described in [9], ours has fewer bits and noticeably less area. Finally, to estimate the area cost of the additional control logic required for slicing the SIMD datapath, we examine published literature on the percentage of total core area other processors devote to control [34,4,21,20].…”
Section: Gang Splitting Policiesmentioning
confidence: 92%
See 2 more Smart Citations
“…For the per-slice scoreboards, we use a larger FlipFlop cell (3.6µm 2 scaled to 40nm) from the NanGate library and 3× area overhead factor to account for the comparators necessary for an associative lookup. Compared to the scoreboard described in [9], ours has fewer bits and noticeably less area. Finally, to estimate the area cost of the additional control logic required for slicing the SIMD datapath, we examine published literature on the percentage of total core area other processors devote to control [34,4,21,20].…”
Section: Gang Splitting Policiesmentioning
confidence: 92%
“…and (3) techniques that allow the interleaved execution of multiple branch paths within a warp by scheduling multiple paths from the control flow stack [31,9]. The fundamental characteristic that sets I-VWS and the use of small warps apart from prior work is the ability to concurrently issue many more unique PCs by scaling and distributing instruction fetch bandwidth.…”
Section: Qualitative Comparisonmentioning
confidence: 99%
See 1 more Smart Citation
“…Rhu and Erez [29] examine a dualpath execution model provided by two PC reconvergence stacks and two register scoreboards in order to expose the warp scheduler to more parallelism when facing divergent execution paths. To extend this solution, [30] replaces the reconvergence stack with two warp split and warp reconvergence tables. Rogers et.…”
Section: Related Work On Divergencementioning
confidence: 99%
“…Prior work such as [11,12,5,29,30,26,31,9,24] proposes various techniques to improve Single Instruction Multiple Data (SIMD) efficiency or increase thread level parallelism for divergent applications on GPUs. However, the use of small warps is the only way to improve both SIMD efficiency and thread level parallelism in divergent code.…”
Section: Introductionmentioning
confidence: 99%