A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Gebhart, Mark; Johnson, D.; Tarjan, David; Keckler, Stephen W.; Dally, William J.; Lindholm, Erik; Skadron, Kevin

doi:10.1145/2166879.2166882

Cited by 32 publications

(17 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While this work focused on modeling of concurrent execution, it did not discuss how to achieve the desired concurrency level. Other relevant GPU work includes topics such as GPU exception handling [18], where register states need to be restored for resuming execution after exception, and energy saving [10], where the location of registers is critical to energy consumption because the distance between the registers and processors determines the amount of energy consumed during data movement, and hardware register space saving [29],which combines SRAM and DRAM to store more bits into the die area. In [28], a means of optimizing shared memory is explored in order to prevent userallocated shared memory from reducing occupancy, whereas our approach makes use of non user-allocated shared memory to lessen the cost of improving occupancy.…”

Section: Related Workmentioning

confidence: 99%

Unified on-chip memory allocation for SIMT architecture

Hayes

Zhang

2014

Proceedings of the 28th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture -single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast "context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance.Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Unified on-chip memory allocation for SIMT architecture

Hayes

Zhang

2014

Proceedings of the 28th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…This is the latest proposal on the register file for area and energy efficiency [20], and researchers in NVIDIA adopted this idea for their GPUs [21], [22].…”

Section: Norcs [20]-[22]mentioning

confidence: 99%

Skewed Multistaged Multibanked Register File for Area and Energy Efficiency

Yamada

Jimbo

Shioya

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe region that includes the register file is a hot spot in high-performance cores that limits the clock frequency. Although multibanking drastically reduces the area and energy consumption of the register files of superscalar processor cores, it suffers from low IPC due to bank conflicts. Our skewed multistaging drastically reduces not the bank conflict probability but the pipeline disturbance probability by the second stage. The evaluation results show that, compared with NORCS, which is the latest research on a register file for area and energy efficiency, a proposed register file with 18 banks achieves a 39.9% and 66.4% reduction in circuit area and in energy consumption, while maintaining a relative IPC of 97.5%.

show abstract

“…The commonly used Loose Round Robin (LRR) [17] serves all warps as instruction issuance candidates to maximize TLP. Previous work [16,30] has showed that LRR cannot efficiently hide the long latency operations, and causes high resource contention.…”

Section: Gpgpu Schedulersmentioning

confidence: 99%

“…Existing warp scheduling schemes such as LRR [17] and TL [16,30] try either to maximize TLP or to select a universal TLP parameter. Previous work [16,23] shows that the maximized or a universal TLP does not always deliver the optimal performance, because it is impossible to fit various access patterns for diverse applications.…”

Section: Introductionmentioning

confidence: 99%

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Xiao

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.

show abstract

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Cited by 32 publications

References 50 publications

Unified on-chip memory allocation for SIMT architecture

Unified on-chip memory allocation for SIMT architecture

Skewed Multistaged Multibanked Register File for Area and Energy Efficiency

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Contact Info

Product

Resources

About