Energy-efficient mechanisms for managing thread context in throughput processors

Gebhart, Mark; Johnson, D.; Tarjan, David; Keckler, Stephen W.; Dally, William J.; Lindholm, Erik; Skadron, Kevin

doi:10.1145/2000064.2000093

Cited by 214 publications

(105 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several researchers have proposed a variety of schedulers that preferentially schedule out of a small pool of warps [15], [16]. These two-level schedulers have been developed for a number of reasons, but all of them generally have the effect of reducing contention in the caches and memory subsystem by limiting the number of co-scheduled warps.…”

Section: Related Workmentioning

confidence: 99%

Priority-based cache allocation in throughput processors

Liu

Rhu

Johnson³

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can increase contention for various system resources, however, that may result in suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and offchip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while offchip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%. 89 978-1-4799-8930-0/15/$31.00 ©2015 IEEE

show abstract

Section: Related Workmentioning

confidence: 99%

Priority-based cache allocation in throughput processors

Liu

Rhu

Johnson³

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Later, Gebhart et al [26] used a two-level warp scheduling technique so as to reduce the consumption of energy. The researchers noticed that the written registers are often read last within three instructions after they are written.…”

Section: Mechanism(tl W)mentioning

confidence: 99%

A survey of techniques for warp scheduling in GPUs

Sandokji

Essa

Fadel

2015

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)

View full text Add to dashboard Cite

The heterogeneous nature of Graphics processor unit (GPU) -CPU makes it a candidate for coming exascale systems.The cores of GPGPU-which is a cost-effective computing platform-are characterized by long periods of inactive times, which results in the underutilization of the hardware resources. This is due to several factors like the limitation of on-chip memory and register files, the inefficient scheduling mechanisms, and communication bottlenecks GPU -CPU communication. In order to counteract the underutilization of recourses, certain techniques have been proposed. In this research, many architectural and system-level techniques aiming to manage and fully leverage GPU resources are surveyed, compared and evaluated. Also, the significance and challenges of warp scheduler in GPUs are thoroughly discussed. The main purpose of this paper is to provide researchers an insight into warp scheduler techniques for GPUs, as well as motivate them to present more efficient methods for enhance performance via improve thread scheduler in future GPUs.

show abstract

“…We have already provided quantitative comparisons of our proposal with the two-level scheduler. Gebhart and Johnson et al [12] propose a two-level warp scheduling technique that aims to reduce energy consumption in GPUs. Jog et al [19] propose OWL, a series of CTA-aware warp scheduling techniques to reduce cache contention and improve DRAM performance for bandwidth-limited GPGPU applications.…”

Section: Related Workmentioning

confidence: 99%

Reshaping cache misses to improve row-buffer locality in multicore systems

Kayıran

Jog

Kandemir

et al. 2013

Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

General-purpose graphics processing units (GPG-PUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs). CTAs are groups of threads and can be executed in any order, thereby providing ample opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting TLP.However, we demonstrate in this paper that executing the maximum possible number of CTAs on a core is not always the optimal choice from the performance perspective. High number of concurrently executing threads might cause more memory requests to be issued, and create contention in the caches, network and memory, leading to long stalls at the cores. To reduce resource contention, we propose a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics. To minimize resource contention, DYNCTA allocates fewer CTAs for applications suffering from high contention in the memory subsystem, compared to applications demonstrating high throughput. Simulation results on a 30-core GPGPU platform with 31 applications show that the proposed CTA scheduler provides 28% average improvement in performance compared to the existing CTA scheduler.

show abstract

Energy-efficient mechanisms for managing thread context in throughput processors

Cited by 214 publications

References 30 publications

Priority-based cache allocation in throughput processors

Priority-based cache allocation in throughput processors

A survey of techniques for warp scheduling in GPUs

Reshaping cache misses to improve row-buffer locality in multicore systems

Contact Info

Product

Resources

About