2017
DOI: 10.1145/3140659.3080239
|View full text |Cite
|
Sign up to set email alerts
|

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU

Abstract: Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space. In this paper we discover that individual load instructions in a warp exhibit four different types of data local… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 20 publications
(16 citation statements)
references
References 32 publications
0
16
0
Order By: Relevance
“…Cache bypassing: Cache bypassing schemes also aim to improve memory system performance in GPUs. Therefore, we evaluate Poise against APCM [28], state-of-the-art scheme to bypass and protect cache lines on the basis of instruction locality. APCM achieves this by filtering streaming accesses from high locality accesses.…”
Section: J Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Cache bypassing: Cache bypassing schemes also aim to improve memory system performance in GPUs. Therefore, we evaluate Poise against APCM [28], state-of-the-art scheme to bypass and protect cache lines on the basis of instruction locality. APCM achieves this by filtering streaming accesses from high locality accesses.…”
Section: J Discussionmentioning
confidence: 99%
“…Therefore, it suffers from the same limitations as PCAL that were discussed previously in Section III-C. More recently, Lee and Wu [32] proposed an instruction-based scheme to bypass requests from low reuse memory instructions. Similarly, Koo et al [28] proposed APCM, an instruction-based scheme to not only bypass, but also to protect cache lines using instruction locality characteristics (discussed in SectionVII-J). Furthermore, Jia et al [24] presented a taxonomy for memory access locality and proposed a compile-time algorithm to selectively utilize the L1 caches for different locality types.…”
Section: Related Workmentioning
confidence: 99%
“…While the goal of these schedulers is to improve cache performance, our approach 1) is not dependent on any scheduling algorithm, 2) does not require any software support to determine private and shared data, and 3) does not only reduce replication but can eliminate it. In general, prior L1 cache capacity management works based on bypassing [34,62], sectoring [53], or compression [4] do not ensure zero data replication across L1s. However, they can continue to improve the performance of local L1 caches while our shared L1 organization can facilitate coordination across L1s for their better utilization.…”
Section: Related Workmentioning
confidence: 99%
“…Works in this section are categorized into two fields. First field aims to increase data re-usage at cache level using various cache management policies (e.g., bypassing [74], buffering [8], and pinning [31]). The Locality Descriptor [65] is primarily designed to convey locality semantics to leverage cache and NUMA locality in GPUs.…”
Section: Related Workmentioning
confidence: 99%