Proceedings of the 8th Workshop on General Purpose Processing Using GPUs 2015
DOI: 10.1145/2716282.2716291
|View full text |Cite
|
Sign up to set email alerts
|

Efficient utilization of GPGPU cache hierarchy

Abstract: Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe resource contention and low data-sharing which may degrade the performance instead.In this work, we propose three techniques to efficiently utilize and improve the performance of GPGPU caches. The first technique aims to dynamically detect and bypass memory acce… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(9 citation statements)
references
References 35 publications
(87 reference statements)
0
9
0
Order By: Relevance
“…For instance, there can be thousands of threads competing for a small 16‐kB L1D cache in the Fermi architecture. Cache thrashing is likely to hurt the performance as observed in many previous publications, and the problem tends to exaggerate because of the increasing number of concurrent threads in future GPGPUs.…”
Section: Introductionmentioning
confidence: 86%
See 1 more Smart Citation
“…For instance, there can be thousands of threads competing for a small 16‐kB L1D cache in the Fermi architecture. Cache thrashing is likely to hurt the performance as observed in many previous publications, and the problem tends to exaggerate because of the increasing number of concurrent threads in future GPGPUs.…”
Section: Introductionmentioning
confidence: 86%
“…However, as the number of threads in GPGPUs are much more than that in CPUs, the small cache capacity available in modern GPGPUs are far from sufficient, which causes first-level data (L1D) to swap in/out useful cache lines frequently, namely the cache FIGURE 1 Baseline architecture and on-chip memory hierarchy. L1D, first-level data; PC, program counter; SIMT, single instruction multiple threads; SIMD, single instruction multiple data; MC, memory controller; L2$, second-level cache thrashing is likely to hurt the performance as observed in many previous publications, [6][7][8] and the problem tends to exaggerate because of the increasing number of concurrent threads in future GPGPUs.…”
Section: Introductionmentioning
confidence: 99%
“…Mahmoud Khairy, Mohamed Zahran, and Amr G. Wassal [6] have experimented three techniques for utilizing GPGPU cache efficiently. They are,…”
Section: Related Workmentioning
confidence: 99%
“…This information tabulated in Table I. The cache memory information about the GPGPU of the experimental setup was gathered from a CUDA programming book written by S. Cook [5] and the manual of Fermi architecture [6]. The details of GPGPU cache architecture have been given in Table II.…”
Section: Setupsmentioning
confidence: 99%
“…Such warp scheduling, however, is not efficient for memoryintensive applications in which active warps collectively generate too many memory requests and thus contend for limited cache space [2], [3]. Prior work reports that such cache contention (or interference) frequently incurs cache trashing and therefore severely degrades performances [4], [5], [6]. For example, our own experiment shows that a GPU can improve the geometric-mean performance of popular benchmark suites such as PolyBench [7], Mars [8] and This paper is accepted by and will be published at 2018 International Parallel and Distributed Processing Symposium.…”
Section: Introductionmentioning
confidence: 99%