Energy-efficient mechanisms for managing thread context in throughput processors

Gebhart, Mark; Johnson, D.; Tarjan, David; Keckler, Stephen W.; Dally, William J.; Lindholm, Erik; Skadron, Kevin

doi:10.1145/2024723.2000093

Cited by 65 publications

(72 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It only induces 1.5% performance overhead based on our evaluation across a large set of GPGPU benchmarks (detailed experimental methodologies are described in Section 4.1. ), which also matches the observation made in [4]. During the register renaming stage, the destination register ID is renamed to a free physical register.…”

Section: Memory Contention-aware Tfet Register Allocationsupporting

confidence: 79%

“…The output is written back to the counter, it will be read when the warp enters into the pipeline, and a larger-than-one value in the counter implies the necessity of writing to the TFET-based register. In [4], Gebhart et al found that 70% of the register values are read only once in GPGPU workloads. It implies that most TFET register values are read once, therefore, renaming the destination register to the TFET register usually causes 2-cycle extra delay: one additional cycle during the value write back, and another one when it is read by a subsequent instruction.…”

Section: Methodsmentioning

confidence: 99%

“…For example, Nvidia Fermi GPU supports more than 20,000 parallel threads and contains 2MB register files [7]. Accessing such sizeable register files leads to massive power consumption [2][3][4][5][6]. It has been reported that the register files consume 15%-20% of the GPU stream multiprocessor's power [8].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Tan

2013

International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

State-of-the-art General-Purpose computing on Graphics Processing Unit (GPGPU) is facing severe power challenge due to the increasing number of cores placed on a chip with decreasing feature size. In order to hide the long latency operations, GPGPU employs the fine-grained multithreading among numerous active threads, leading to the sizeable register files with massive power consumption. Exploring the optimal power savings in register files becomes the critical and first step towards the energyefficient GPGPUs. The conventional method to reduce dynamic power consumption is the supply voltage scaling, and the inter-bank tunneling FETs (TFETs) are the promising candidates compared to CMOS for low voltage operations regarding to both leakage and performance. However, always executing at the low voltage (so that low frequency) will result in significant performance degradation. In this study, we propose the hybrid CMOS-TFET based register files. To optimize the register power consumption, we allocate TFET-based registers to threads whose execution progress can be delayed to some degree to avoid the memory contentions with other threads, and the CMOS-based registers are still used for threads requiring normal execution speed. Our experimental results show that the proposed technique achieves 30% energy (including both dynamic and leakage) reduction in register files with little performance degradation compared to the baseline case equipped with naive power optimization technique.

show abstract

Section: Memory Contention-aware Tfet Register Allocationsupporting

confidence: 79%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Tan

2013

International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

show abstract

“…Another reason for warplevel divergence is due to warp scheduling policies which may prioritize some warps over the others in a TB. For example, the recently proposed two-level scheduling [9][17] tries to better overlap memory access latency with computations by intentionally making some warps runs somewhat faster than others. In Figure 12, we compare the impact from two scheduling policies, round robin (labeled as 'RR') and two-level (labeled as '2L').…”

Section: Program-dependent Workload Imbalancementioning

confidence: 99%

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Xiang

Yang²,

Zhou

2014

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…La politique d'ordonnancement des warps considérée consiste à sélectionner l'instruction prête la plus agée. Nous considérons une mémoire limitée en débit et de latence fixée, suivant la méthodologie observée par (Gebhart et al, 2011). Nous calibrons les paramètres du modèle d'après les résultats de microtests …”

Section: Méthodologie De Simulationunclassified

Reconvergence de contrôle implicite pour les architectures SIMT

Brunie¹,

Collange²

2013

Techniques et sciences informatiques

View full text Add to dashboard Cite

RÉSUMÉ. Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels.ABSTRACT. Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.MOTS-CLÉS : Reconvergence de flot de contrôle, SIMD, SIMT, GPU

show abstract

Energy-efficient mechanisms for managing thread context in throughput processors

Cited by 65 publications

References 26 publications

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Reconvergence de contrôle implicite pour les architectures SIMT

Contact Info

Product

Resources

About