Adaptive Cache Management for Energy-Efficient GPU Computing

Chen, Xuhao; Chang, Li-Wen; Rodrigues, Christopher; Lv, Jie; Wang, Zhiying; Hwu, Wen-mei W.

doi:10.1109/micro.2014.11

Cited by 146 publications

(79 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen et al [20] designed a hardware sampling based method on GPUs for LI data cache bypassing and used warp throttling to reduce contention. Tian et al [21] implemented a PC-based dynamic GPU cache bypassing predictor.…”

Section: Related Workmentioning

confidence: 99%

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Huangfu¹,

Zhang²

2017

Journal of Computing Science and Engineering

View full text Add to dashboard Cite

Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.

show abstract

Section: Related Workmentioning

confidence: 99%

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Huangfu¹,

Zhang²

2017

Journal of Computing Science and Engineering

View full text Add to dashboard Cite

show abstract

“…The CPU-based approaches are usually designed for last level caches (LLCs), where data locality is already filtered by previous level(s) of caches. But the poor locality of GPU workloads and resource congestion impose difficulty for them to make robust predictions and they often increase L2 and DRAM level traffic [11] (Section 6.1(a)). GPU-based bypassing schemes are generally conditional/reactive bypassing (e.g., bypass upon unavailable resources [15] or coarse-grained bypassing on warps or threadblocks [30,31,27]) which can incorrectly bypass accesses with good reuse and cause memory pipeline stalls (Section 6.1(a)).…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, a fully-adaptive bypassing scheme is required to maintain the efficiency of workloads with good caching behavior, which is often neglected by previous approaches [24,15,12,11] (Section 6.1(b)).…”

Section: Introductionmentioning

confidence: 99%

Locality-Driven Dynamic GPU Cache Bypassing

Song

Dai

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

103

View full text Add to dashboard Cite

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from singleinstruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

show abstract

“…There is limited communication between different workgroups. Since GPU applications generally exhibit little L1 temporal locality [9], the communication between the L1 and L2 caches becomes the main source of traffic on the GPU's on-chip interconnection network. As the number of CUs increases in each future generation of GPU systems, latency in the on-chip interconnection network becomes a major performance bottleneck on the GPU [6].…”

Section: Introductionmentioning

confidence: 99%

Leveraging Silicon-Photonic NoC for Designing Scalable GPUs

Ziabari

Abellán

Ubal

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future manycore systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems.To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid siliconphotonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6× (2.7× on average) and reduces the energy-delay 2 product (ED 2 P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9× (43% on average) and achieve up to 62% reduction in ED 2 P (3% on average) in comparison to mesh design with best performance.

show abstract

Adaptive Cache Management for Energy-Efficient GPU Computing

Cited by 146 publications

References 36 publications

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Locality-Driven Dynamic GPU Cache Bypassing

Leveraging Silicon-Photonic NoC for Designing Scalable GPUs

Contact Info

Product

Resources

About