Proceedings of the 46th International Symposium on Computer Architecture 2019
DOI: 10.1145/3307650.3322235
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive memory-side last-level GPU caching

Abstract: Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
22
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 15 publications
(23 citation statements)
references
References 79 publications
1
22
0
Order By: Relevance
“…We observe almost no performance improvement for the shared-friendly applications compared to the baseline. This is because performance is limited by the L2 reply bandwidth bottleneck [49,73,74]. Such a bottleneck is relieved with Shared++ and DynEB as the shared L1 organization utilizes the remote cores as an additional source of bandwidth.…”
Section: Sensitivity Studiesmentioning
confidence: 99%
See 2 more Smart Citations
“…We observe almost no performance improvement for the shared-friendly applications compared to the baseline. This is because performance is limited by the L2 reply bandwidth bottleneck [49,73,74]. Such a bottleneck is relieved with Shared++ and DynEB as the shared L1 organization utilizes the remote cores as an additional source of bandwidth.…”
Section: Sensitivity Studiesmentioning
confidence: 99%
“…Our shared L1 organization utilizes inter-core communication to eliminate the L1 cache wastage without the need for searching or prediction. Zhao et al [73] boost performance of applications with high degrees of data sharing between cores by replicating the shared cache lines across different L2 slices. This is complementary to our work as ours improves the L1 bandwidth utilization while their work improves the L2 bandwidth.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…of data sets for our replication-sensitive applications. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the baseline shared LLC organization and state-of-the-art Adaptive LLC [75], respectively. In summary, we make the following major contributions:…”
Section: Introductionmentioning
confidence: 97%
“…Unfortunately, this significantly reduces the effective LLC capacity as copies of shared data are now stored in multiple LLC slices. The state-of-the-art Adaptive LLC [75] dynamically selects either a shared or private organization based on application behavior. Unfortunately, this all-or-nothing approach only addresses the LLC serialization problem when the shared data set is small enough to fit in the LLC with maximum replication.…”
Section: Introductionmentioning
confidence: 99%