SIMD re-convergence at thread frontiers

Diamos, Gregory; Ashbaugh, Benjamin; Maiyuran, Subramaniam; Kerr, Andrew; Wu, Hong; Yalamanchili, Sudhakar

doi:10.1145/2155620.2155676

Cited by 66 publications

(62 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…There are two common ways to maintaining the logical PC of each thread. The first, used by the GPU in Intel's Sandy Bridge [7,16], maintains a separate PC for each thread and masks out threads that do not match the current per warp PC. NVIDIA and AMD GPUs use an alternate mechanism in which the active masks are stored on a reconvergence stack, which we explain below.…”

Section: Divergent Control Flowmentioning

confidence: 99%

“…This work was extended to dynamic warp subdivision [18], which allows warp subsets to be scheduled independently to enhance latency tolerance. Diamos et al [7] propose Thread Frontier as an alternative to the immediate post dominator reconvergence algorithm. Thread frontiers use the earliest reconvergence point possible in an unstructured control flow [31].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Rhu¹,

Erez²

2012

2012 39th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compactioneffectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.

show abstract

Section: Divergent Control Flowmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Rhu¹,

Erez²

2012

2012 39th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

show abstract

“…They do not support exceptions or interruptions, which prevents their use with a general-purpose system software stack. Various works extend the SIMT model to support more generic code [22], [23] or more flexible execution [24], [25], [26]. However, they all target applications specifically written for GPUs, rather than general-purpose parallel applications.…”

Section: E Power and Energymentioning

confidence: 99%

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Kalathingal

Collange

Swamy

et al. 2016

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

Abstract-Threads of Single-Program Multiple-Data (SPMD) applications often execute the same instructions on different data. We propose the Dynamic Inter-Thread Vectorization Architecture (DITVA) to leverage this implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA extends an SIMD-enabled in-order SMT processor with an inter-thread vectorization execution mode. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. To balance thread-and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.Our evaluation on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks shows that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance than a 4-thread 4-issue SMT architecture with AVX instructions while fetching and issuing 51% fewer instructions, achieving an overall 24% energy reduction.

show abstract

“…Il n'est donc pas strictement nécessaire de recourir à un mécanisme tel que celui employé par NVIDIA pour exécuter du code arbitraire. Cependant, la technique à base de sauts et annotations permet d'éviter la duplication statique de code en la remplaçant par de la duplication dynamique (Diamos et al, 2011). Le mécanisme utilisé par Tesla peut également être étendu à certains sauts indirects, comme le propose l'architecture Fermi (Nickolls, Dally, 2010).…”

Section: Nvidia Teslaunclassified

“…Diamos et ses coauteurs formalisent cette approche en présentant un algorithme permettant de calculer l'ordre optimal des blocs de base et proposent une réalisation logicielle (Diamos et al, 2011).…”

Section: Lorie-strongunclassified

Reconvergence de contrôle implicite pour les architectures SIMT

Brunie¹,

Collange²

2013

Techniques et sciences informatiques

View full text Add to dashboard Cite

RÉSUMÉ. Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels.ABSTRACT. Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.MOTS-CLÉS : Reconvergence de flot de contrôle, SIMD, SIMT, GPU

show abstract

SIMD re-convergence at thread frontiers

Cited by 66 publications

References 19 publications

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP

Reconvergence de contrôle implicite pour les architectures SIMT

Contact Info

Product

Resources

About