Priority-based cache allocation in throughput processors

Liu, Dong; Rhu, Minsoo; Johnson, D.; O'Connor, Mike; Erez, Mattan; Burger, Doug; Fussell, Donald S.; Redder, Stephen W.

doi:10.1109/hpca.2015.7056024

Cited by 71 publications

(37 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”

Section: A Methodologymentioning

confidence: 99%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality. Consequently, cache locality-aware warp scheduling may undesirably increase cache interference and/or unnecessarily decrease TLP.In this paper, we propose Cache Interference-Aware throughput-Oriented (CIAO) on-chip memory architecture and warp scheduling which exploit unused shared memory space and take insight opposite to cache locality-aware warp scheduling. Specifically, CIAO on-chip memory architecture can adaptively redirect memory requests of severely interfering warps to unused shared memory space to isolate memory requests of these interfering warps from those of interfered warps. If these interfering warps still incur severe cache interference, CIAO warp scheduling then begins to selectively throttle execution of these interfering warps. Our experiment shows that CIAO can offer 54% higher performance than prior cache locality-aware scheduling at a small chip cost.

show abstract

Section: A Methodologymentioning

confidence: 99%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…There are many prior studies oriented towards eliminating GPU L1D cache thrashing. For example, [13], [47], [48], [49], [50] propose to bypass part of data requests to the off-chip memory in an attempt to protect data blocks in L1D cache from early eviction. However, these bypass strategies can degrade the overall performance and make energy efficiency worse, since they typically introduce more off-chip access overheads.…”

Section: Related Workmentioning

confidence: 99%

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Zhang

Kandemir

2019

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

show abstract

“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”

Section: Preloading In the Shared Memorymentioning

confidence: 99%

“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”

Section: Impact Of Various Preload Factorsmentioning

confidence: 99%

Static code transformations for thread‐dense memory accesses in GPU computing

Kim

Hong

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers.Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. KEYWORDScode transformation, GPU computing, shared memory, static analysis INTRODUCTIONGraphics processing units (GPUs) are very useful in accelerating scientific applications and even in accelerating machine learning applications.To take advantage of GPU computing power in these applications, it is essential to reduce the high-latency off-chip memory access. Several memory access transformation methods have been explored to optimize off-chip memory accesses by utilizing low-latency on-chip memory. 1-8The use of shared memory, which is located in on-chip and explicitly managed by user-written kernel codes, is one way to avoid the high-latency overhead within off-chip memory access. 9-16 Despite the beneficial characteristics of shared memory, applications often leave out shared memory unused, mainly due to the extra management of address space in shared memory. Domain-specific programmers prefer using the hardware-managed L1 cache rather than shared memory for programming simplicity. 2,17 None of 13 applications in PolyBench benchmark and only 14 of 23 applications in Rodinia benchmarks use shared memory. 18,19Complex memory system and massive thread-level parallelism often make it prohibitively difficult for domain-specific application programmers to optimize memory access patterns in GPU computing. Furthermore, GPU architectures are evolving rapidly, which makes developers rewrite GPU kernels for different generations. To overcome these hurdles, studies of compiler-based optimization and analysis tools are carried out to support programmers with no in-depth knowledge of the GPU architecture. 5,6,11,[20][21][22][23][24] This paper proposes a static code transformation for preloading data in shared memory of GPUs. Our software-only approach focuses primarily on off...

show abstract

Priority-based cache allocation in throughput processors

Cited by 71 publications

References 18 publications

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Static code transformations for thread‐dense memory accesses in GPU computing

Contact Info

Product

Resources

About