Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture 2011
DOI: 10.1145/2155620.2155676
|View full text |Cite
|
Sign up to set email alerts
|

SIMD re-convergence at thread frontiers

Abstract: Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program's control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
60
0
2

Year Published

2012
2012
2019
2019

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 66 publications
(62 citation statements)
references
References 19 publications
0
60
0
2
Order By: Relevance
“…There are two common ways to maintaining the logical PC of each thread. The first, used by the GPU in Intel's Sandy Bridge [7,16], maintains a separate PC for each thread and masks out threads that do not match the current per warp PC. NVIDIA and AMD GPUs use an alternate mechanism in which the active masks are stored on a reconvergence stack, which we explain below.…”
Section: Divergent Control Flowmentioning
confidence: 99%
See 1 more Smart Citation
“…There are two common ways to maintaining the logical PC of each thread. The first, used by the GPU in Intel's Sandy Bridge [7,16], maintains a separate PC for each thread and masks out threads that do not match the current per warp PC. NVIDIA and AMD GPUs use an alternate mechanism in which the active masks are stored on a reconvergence stack, which we explain below.…”
Section: Divergent Control Flowmentioning
confidence: 99%
“…This work was extended to dynamic warp subdivision [18], which allows warp subsets to be scheduled independently to enhance latency tolerance. Diamos et al [7] propose Thread Frontier as an alternative to the immediate post dominator reconvergence algorithm. Thread frontiers use the earliest reconvergence point possible in an unstructured control flow [31].…”
Section: Related Workmentioning
confidence: 99%
“…They do not support exceptions or interruptions, which prevents their use with a general-purpose system software stack. Various works extend the SIMT model to support more generic code [22], [23] or more flexible execution [24], [25], [26]. However, they all target applications specifically written for GPUs, rather than general-purpose parallel applications.…”
Section: E Power and Energymentioning
confidence: 99%
“…Il n'est donc pas strictement nécessaire de recourir à un mécanisme tel que celui employé par NVIDIA pour exécuter du code arbitraire. Cependant, la technique à base de sauts et annotations permet d'éviter la duplication statique de code en la remplaçant par de la duplication dynamique (Diamos et al, 2011). Le mécanisme utilisé par Tesla peut également être étendu à certains sauts indirects, comme le propose l'architecture Fermi (Nickolls, Dally, 2010).…”
Section: Nvidia Teslaunclassified
“…Diamos et ses coauteurs formalisent cette approche en présentant un algorithme permettant de calculer l'ordre optimal des blocs de base et proposent une réalisation logicielle (Diamos et al, 2011).…”
Section: Lorie-strongunclassified