2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) 2015
DOI: 10.1109/hpca.2015.7056024
|View full text |Cite
|
Sign up to set email alerts
|

Priority-based cache allocation in throughput processors

Abstract: GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can increase contention for various system resources, however, that may result in suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and offchip memory bandwidth. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(37 citation statements)
references
References 18 publications
0
37
0
Order By: Relevance
“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”
Section: A Methodologymentioning
confidence: 99%
“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”
Section: A Methodologymentioning
confidence: 99%
“…There are many prior studies oriented towards eliminating GPU L1D cache thrashing. For example, [13], [47], [48], [49], [50] propose to bypass part of data requests to the off-chip memory in an attempt to protect data blocks in L1D cache from early eviction. However, these bypass strategies can degrade the overall performance and make energy efficiency worse, since they typically introduce more off-chip access overheads.…”
Section: Related Workmentioning
confidence: 99%
“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”
Section: Preloading In the Shared Memorymentioning
confidence: 99%
“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”
Section: Impact Of Various Preload Factorsmentioning
confidence: 99%