Proceedings of the 26th ACM International Conference on Supercomputing 2012
DOI: 10.1145/2304576.2304582
|View full text |Cite
|
Sign up to set email alerts
|

Characterizing and improving the use of demand-fetched caches in GPUs

Abstract: Initially introduced as special-purpose accelerators for games and graphics code, graphics processing units (GPUs) have emerged as widely-used high-performance parallel computing platforms. GPUs traditionally provided only softwaremanaged local memories (or scratchpads) instead of demandfetched caches. Increasingly, however, GPUs are being used in broader application domains where memory access patterns are both harder to analyze and harder to manage in software-controlled caches. In response, GPU vendors have… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
86
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 128 publications
(86 citation statements)
references
References 20 publications
0
86
0
Order By: Relevance
“…As pointed out in previous literature [10,36], many GPU applications do not cache well and suffer from high cache miss rates and low block reuse. Such low caching efficiency occurs both due to streaming data accesses and also because the threads contend for cache resources and constrain the effective on-chip storage available to each thread.…”
Section: Pitfalls Of Spp-based Gdu In Gpusmentioning
confidence: 94%
See 1 more Smart Citation
“…As pointed out in previous literature [10,36], many GPU applications do not cache well and suffer from high cache miss rates and low block reuse. Such low caching efficiency occurs both due to streaming data accesses and also because the threads contend for cache resources and constrain the effective on-chip storage available to each thread.…”
Section: Pitfalls Of Spp-based Gdu In Gpusmentioning
confidence: 94%
“…Recent literature [36,10] shows that throughput processors make poor use of data caches, due to the high cache access intensity and the resulting low per-thread cache capacity. To this end, previous work makes the warp scheduler cache-conscious [10] such that the number of warps able to access the cache are dynamically reduced (hence throttling thread-level parallelism [TLP] available at the SM) if the cache is thrashing.…”
Section: Thread-level Parallelismmentioning
confidence: 99%
“…It is even more important for APUs to include contention-resistant or congestion-resistant techniques because their GPU private caches are relatively smaller than those of discrete cards. Compiler or programming techniques for improving GPU cache performance are also investigated [18], [44], [6]. Although static compiler-directed bypassing could be effective for regular applications, we provide a hardware dynamic solution that adapts to different runtime behaviors.…”
Section: B Gpu Cache Managementmentioning
confidence: 99%
“…Typical CPU cache architecture is optimized for memory latency, which does not necessarily benefit throughputoriented processors like GPUs, because massive multithreading makes cache locality difficult to capture [18], [10], [38]. Like CPU caches, GPU caches performance is hampered by thrashing, particularly inter-warp contention [39], [19], which is much more common in GPUs due to massive multithreading.…”
Section: Introductionmentioning
confidence: 99%
“…The GPGPU L2 cache, typically with a size of 768KByte [10], is relative small when compared to the RF in a GPGPU. The huge amount of concurrent threads and zero-cost context switching request the RF to be as large as possible and the demand is still growing in future GPGPUs.…”
Section: Related Workmentioning
confidence: 99%