Characterizing and improving the use of demand-fetched caches in GPUs

Jia, Wenhao; Shaw, Kelly A.; Martonosi, Margaret

doi:10.1145/2304576.2304582

Cited by 128 publications

(86 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As pointed out in previous literature [10,36], many GPU applications do not cache well and suffer from high cache miss rates and low block reuse. Such low caching efficiency occurs both due to streaming data accesses and also because the threads contend for cache resources and constrain the effective on-chip storage available to each thread.…”

Section: Pitfalls Of Spp-based Gdu In Gpusmentioning

confidence: 94%

“…Recent literature [36,10] shows that throughput processors make poor use of data caches, due to the high cache access intensity and the resulting low per-thread cache capacity. To this end, previous work makes the warp scheduler cache-conscious [10] such that the number of warps able to access the cache are dynamically reduced (hence throttling thread-level parallelism [TLP] available at the SM) if the cache is thrashing.…”

Section: Thread-level Parallelismmentioning

confidence: 99%

See 1 more Smart Citation

A locality-aware memory hierarchy for energy-efficient GPU architectures

Rhu

Sullivan

Leng

et al. 2013

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

103

View full text Add to dashboard Cite

As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.

show abstract

Section: Pitfalls Of Spp-based Gdu In Gpusmentioning

confidence: 94%

Section: Thread-level Parallelismmentioning

confidence: 99%

A locality-aware memory hierarchy for energy-efficient GPU architectures

Rhu

Sullivan

Leng

et al. 2013

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

103

View full text Add to dashboard Cite

show abstract

“…It is even more important for APUs to include contention-resistant or congestion-resistant techniques because their GPU private caches are relatively smaller than those of discrete cards. Compiler or programming techniques for improving GPU cache performance are also investigated [18], [44], [6]. Although static compiler-directed bypassing could be effective for regular applications, we provide a hardware dynamic solution that adapts to different runtime behaviors.…”

Section: B Gpu Cache Managementmentioning

confidence: 99%

“…Typical CPU cache architecture is optimized for memory latency, which does not necessarily benefit throughputoriented processors like GPUs, because massive multithreading makes cache locality difficult to capture [18], [10], [38]. Like CPU caches, GPU caches performance is hampered by thrashing, particularly inter-warp contention [39], [19], which is much more common in GPUs due to massive multithreading.…”

Section: Introductionmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

146

View full text Add to dashboard Cite

Abstract-With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency.The massive amount of memory requests generated by GPUs cause cache contention and resource congestion. Existing CPU cache management policies that are designed for multicore systems, can be suboptimal when directly applied to GPU caches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturating on-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cachesensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPC improvement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

show abstract

“…The GPGPU L2 cache, typically with a size of 768KByte [10], is relative small when compared to the RF in a GPGPU. The huge amount of concurrent threads and zero-cost context switching request the RF to be as large as possible and the demand is still growing in future GPGPUs.…”

Section: Related Workmentioning

confidence: 99%

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

JingNaifeng

ShenYao

LuYao

et al. 2013

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional S-RAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.

show abstract

Characterizing and improving the use of demand-fetched caches in GPUs

Cited by 128 publications

References 20 publications

A locality-aware memory hierarchy for energy-efficient GPU architectures

A locality-aware memory hierarchy for energy-efficient GPU architectures

Adaptive Cache Management for Energy-Efficient GPU Computing

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

Contact Info

Product

Resources

About