Efficient utilization of GPGPU cache hierarchy

Khairy, Mahmoud; Zahran, Mohamed; Wassal, Amr G.

doi:10.1145/2716282.2716291

Cited by 24 publications

(9 citation statements)

References 35 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, there can be thousands of threads competing for a small 16‐kB L1D cache in the Fermi architecture. Cache thrashing is likely to hurt the performance as observed in many previous publications, and the problem tends to exaggerate because of the increasing number of concurrent threads in future GPGPUs.…”

Section: Introductionmentioning

confidence: 86%

“…However, as the number of threads in GPGPUs are much more than that in CPUs, the small cache capacity available in modern GPGPUs are far from sufficient, which causes first-level data (L1D) to swap in/out useful cache lines frequently, namely the cache FIGURE 1 Baseline architecture and on-chip memory hierarchy. L1D, first-level data; PC, program counter; SIMT, single instruction multiple threads; SIMD, single instruction multiple data; MC, memory controller; L2$, second-level cache thrashing is likely to hurt the performance as observed in many previous publications, [6][7][8] and the problem tends to exaggerate because of the increasing number of concurrent threads in future GPGPUs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Incorporating selective victim cache into GPGPU for high‐performance computing

Wang

Fan

Jiāng

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary Contemporary general‐purpose graphic processing units (GPGPUs) successfully parallelize an application into thousands of concurrent threads with remarkably improved performance. Such massive threads will compete for the small‐sized first‐level data (L1D) cache, leading to an exaggerated cache‐thrashing problem, which may degrade the overall performance significantly. In this paper, we propose a selective victim cache design to enable better data locality and higher performance. Instead of a small fully associative structure, we first redesign the victim cache as a set associative structure that is equivalent to the original L1D cache to suit the GPGPU applications with massive concurrent threads. To keep the mostly used data in L1D for better operand service, we apply a simple prediction scheme to avoid costly block interchanges and evictions. To further save the area for data storage, we propose to leverage the unallocated registers and shared memory entries to hold the victim cache data. The experiments demonstrate that our proposed approach can increase the on‐chip data cache hit rate considerably and deliver a better performance with negligible changes to the baseline GPGPU architecture. For example, our selective victim cache design can improve the performance by 41.3% on average, achieving 54.7% increase in data cache hit rate and 21.8% reduction in block interchanges and evictions.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Incorporating selective victim cache into GPGPU for high‐performance computing

Wang

Fan

Jiāng

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Mahmoud Khairy, Mohamed Zahran, and Amr G. Wassal [6] have experimented three techniques for utilizing GPGPU cache efficiently. They are,…”

Section: Related Workmentioning

confidence: 99%

“…This information tabulated in Table I. The cache memory information about the GPGPU of the experimental setup was gathered from a CUDA programming book written by S. Cook [5] and the manual of Fermi architecture [6]. The details of GPGPU cache architecture have been given in Table II.…”

Section: Setupsmentioning

confidence: 99%

To use or not to use: CPUs' cache optimization techniques on GPGPUs

Thambawita

Ragel

Elkaduwe

2016

2016 IEEE International Conference on Information and Automation for Sustainability (ICIAfS)

View full text Add to dashboard Cite

General Purpose Graphic Processing Unit(GPGPU) is used widely for achieving high performance or high throughput in parallel programming. This capability of GPGPUs is very famous in the new era and mostly used for scientific computing which requires more processing power than normal personal computers. Therefore, most of the programmers, researchers and industry use this new concept for their work. However, achieving high-performance or high-throughput using GPGPUs are not an easy task compared with conventional programming concepts in the CPU side. In this research, the CPUs cache memory optimization techniques have been adopted to the GPGPUs cache memory to identify rare performance improvement techniques compared to GPGPU's best practices. The cache optimization techniques of blocking, loop fusion, array merging and array transpose were tested on GPGPUs for finding suitability of these techniques. Finally, we identified that some of the CPU cache optimization techniques go well with the cache memory system of the GPGPU and shows performance improvements while some others show the opposite effect on the GPGPUs compared with the CPUs.

show abstract

“…Such warp scheduling, however, is not efficient for memoryintensive applications in which active warps collectively generate too many memory requests and thus contend for limited cache space [2], [3]. Prior work reports that such cache contention (or interference) frequently incurs cache trashing and therefore severely degrades performances [4], [5], [6]. For example, our own experiment shows that a GPU can improve the geometric-mean performance of popular benchmark suites such as PolyBench [7], Mars [8] and This paper is accepted by and will be published at 2018 International Parallel and Distributed Processing Symposium.…”

Section: Introductionmentioning

confidence: 99%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality. Consequently, cache locality-aware warp scheduling may undesirably increase cache interference and/or unnecessarily decrease TLP.In this paper, we propose Cache Interference-Aware throughput-Oriented (CIAO) on-chip memory architecture and warp scheduling which exploit unused shared memory space and take insight opposite to cache locality-aware warp scheduling. Specifically, CIAO on-chip memory architecture can adaptively redirect memory requests of severely interfering warps to unused shared memory space to isolate memory requests of these interfering warps from those of interfered warps. If these interfering warps still incur severe cache interference, CIAO warp scheduling then begins to selectively throttle execution of these interfering warps. Our experiment shows that CIAO can offer 54% higher performance than prior cache locality-aware scheduling at a small chip cost.

show abstract

Efficient utilization of GPGPU cache hierarchy

Cited by 24 publications

References 35 publications

Incorporating selective victim cache into GPGPU for high‐performance computing

Incorporating selective victim cache into GPGPU for high‐performance computing

To use or not to use: CPUs' cache optimization techniques on GPGPUs

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Contact Info

Product

Resources

About