Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Gebhart, Mark; Keckler, Stephen W.; Khailany, Brucek; Krashinsky, Ronny; Dally, William J.

doi:10.1109/micro.2012.18

Cited by 94 publications

(61 citation statements)

References 20 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work mainly considers L1 cache and our bypass policy is based on reuse distance prediction. A unified GPU on-chip memory design is proposed by Gebhart et al [14] to satisfy varying capacity needs across different applications. LLC management policies for 3D scene rendering workloads on GPUs are explored by Gaur et al [13], while our work focuses on general purpose applications.…”

Section: B Gpu Cache Managementmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

146

View full text Add to dashboard Cite

Abstract-With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency.The massive amount of memory requests generated by GPUs cause cache contention and resource congestion. Existing CPU cache management policies that are designed for multicore systems, can be suboptimal when directly applied to GPU caches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturating on-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cachesensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPC improvement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

show abstract

Section: B Gpu Cache Managementmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

146

View full text Add to dashboard Cite

show abstract

“…TSIMT register files use the same basic design idea as the register files of conventional GPUs: Instead of using costly multiported memories, multiple single ported SRAM banks are used [Lindholm et al 2008b;Gebhart et al 2012]. These register banks are connected using a crossbar to a operand collector.…”

Section: Register Filementioning

confidence: 99%

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Lucas

Andersch

Alvarez-Mesa

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. Therefor we present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatio-Temporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction but a combination of Scalarization and STSIMT yields a mean performance enhancement of 19.6% and improve the energy-delay-product by 26.2% compared to SIMT.

show abstract

“…We implement a malleable memory system proposed by Gebhart et al that allows flexible use of on-chip SRAM to optimize energy efficiency [25]. Rather than having a fixed pool or registers per thread and cache-capacity per thread or compute cluster, malleable memory allows the compiler to identify and expose the number of registers that will be needed for any given kernel execution.…”

Section: B Throughput Optimized Core Architecturementioning

confidence: 99%

Scaling the Power Wall: A Path to Exascale

Villa

Johnson

O'Connor³

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Abstract-Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signaling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.

show abstract

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Cited by 94 publications

References 20 publications

Adaptive Cache Management for Energy-Efficient GPU Computing

Adaptive Cache Management for Energy-Efficient GPU Computing

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Scaling the Power Wall: A Path to Exascale

Contact Info

Product

Resources

About