2012
DOI: 10.1145/2166879.2166882
|View full text |Cite
|
Sign up to set email alerts
|

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Abstract: Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(17 citation statements)
references
References 50 publications
0
17
0
Order By: Relevance
“…While this work focused on modeling of concurrent execution, it did not discuss how to achieve the desired concurrency level. Other relevant GPU work includes topics such as GPU exception handling [18], where register states need to be restored for resuming execution after exception, and energy saving [10], where the location of registers is critical to energy consumption because the distance between the registers and processors determines the amount of energy consumed during data movement, and hardware register space saving [29],which combines SRAM and DRAM to store more bits into the die area. In [28], a means of optimizing shared memory is explored in order to prevent userallocated shared memory from reducing occupancy, whereas our approach makes use of non user-allocated shared memory to lessen the cost of improving occupancy.…”
Section: Related Workmentioning
confidence: 99%
“…While this work focused on modeling of concurrent execution, it did not discuss how to achieve the desired concurrency level. Other relevant GPU work includes topics such as GPU exception handling [18], where register states need to be restored for resuming execution after exception, and energy saving [10], where the location of registers is critical to energy consumption because the distance between the registers and processors determines the amount of energy consumed during data movement, and hardware register space saving [29],which combines SRAM and DRAM to store more bits into the die area. In [28], a means of optimizing shared memory is explored in order to prevent userallocated shared memory from reducing occupancy, whereas our approach makes use of non user-allocated shared memory to lessen the cost of improving occupancy.…”
Section: Related Workmentioning
confidence: 99%
“…This is the latest proposal on the register file for area and energy efficiency [20], and researchers in NVIDIA adopted this idea for their GPUs [21], [22].…”
Section: Norcs [20]-[22]mentioning
confidence: 99%
“…The commonly used Loose Round Robin (LRR) [17] serves all warps as instruction issuance candidates to maximize TLP. Previous work [16,30] has showed that LRR cannot efficiently hide the long latency operations, and causes high resource contention.…”
Section: Gpgpu Schedulersmentioning
confidence: 99%
“…Existing warp scheduling schemes such as LRR [17] and TL [16,30] try either to maximize TLP or to select a universal TLP parameter. Previous work [16,23] shows that the maximized or a universal TLP does not always deliver the optimal performance, because it is impossible to fit various access patterns for diverse applications.…”
Section: Introductionmentioning
confidence: 99%