2012 39th Annual International Symposium on Computer Architecture (ISCA) 2012
DOI: 10.1109/isca.2012.6237005
|View full text |Cite
|
Sign up to set email alerts
|

Simultaneous branch and warp interweaving for sustained GPU performance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
51
0
1

Year Published

2013
2013
2023
2023

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 56 publications
(54 citation statements)
references
References 25 publications
(17 reference statements)
0
51
0
1
Order By: Relevance
“…For instance, more flexibility could be obtained using Dynamic Warp Formation [24] or Simultaneous Branch Interweaving [25], Dynamic Warp Subdivision [9] could improve latency tolerance by allowing threads to diverge on partial cache misses, and Dynamic Scalarization [29] could further unify redundant dataflow across threads.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…For instance, more flexibility could be obtained using Dynamic Warp Formation [24] or Simultaneous Branch Interweaving [25], Dynamic Warp Subdivision [9] could improve latency tolerance by allowing threads to diverge on partial cache misses, and Dynamic Scalarization [29] could further unify redundant dataflow across threads.…”
Section: Discussionmentioning
confidence: 99%
“…They do not support exceptions or interruptions, which prevents their use with a general-purpose system software stack. Various works extend the SIMT model to support more generic code [22], [23] or more flexible execution [24], [25], [26]. However, they all target applications specifically written for GPUs, rather than general-purpose parallel applications.…”
Section: E Power and Energymentioning
confidence: 99%
“…The major bottleneck of this GPU deployment was the control flow divergence which is penalizing considering the partial SIMD execution (Single Instruction Multiple Data) of the GPU. Hardware [9] and software [10], [11] general solutions have been proposed recently to address this problem on GPU. However, these solutions are not efficient in our context as we have a very fine computation grain for each GPU thread.…”
Section: Motivations and Contributionsmentioning
confidence: 99%
“…Resource underutilization due to branch divergence or thread-level divergence has been well studied [2] reason for resource underutilization, however, is due to TB-level resource management. Shared memory multiplexing [26] targets at the shared memory management and is complementary to our proposed WarpMan scheme.…”
Section: Related Workmentioning
confidence: 99%