Adaptive memory-side last-level GPU caching

Zhao, Xia; Adileh, Almutaz; Yu, Zhibin; Wang, Zhiying; Jaleel, Aamer; Eeckhout, Lieven

doi:10.1145/3307650.3322235

Cited by 15 publications

(23 citation statements)

References 79 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We observe almost no performance improvement for the shared-friendly applications compared to the baseline. This is because performance is limited by the L2 reply bandwidth bottleneck [49,73,74]. Such a bottleneck is relieved with Shared++ and DynEB as the shared L1 organization utilizes the remote cores as an additional source of bandwidth.…”

Section: Sensitivity Studiesmentioning

confidence: 99%

“…Our shared L1 organization utilizes inter-core communication to eliminate the L1 cache wastage without the need for searching or prediction. Zhao et al [73] boost performance of applications with high degrees of data sharing between cores by replicating the shared cache lines across different L2 slices. This is complementary to our work as ours improves the L1 bandwidth utilization while their work improves the L2 bandwidth.…”

Section: Related Workmentioning

confidence: 99%

“…An increase in on-chip memory hit rate can lead to a proportional decrease in memory traffic, translating into performance improvements for memoryintensive programs [45,67]. Therefore, researchers in the past have invested significant efforts in improving cache performance via hardware and software methods [24,26,27,29,32,54,73].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Analyzing and Leveraging Shared L1 Caches in GPUs

Ibrahim

Kayıran

Eckert

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high throughput for a wide range of applications. However, the memory wall often limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improving cache hit rates, which in turn can improve throughput for memoryintensive applications. However, almost all of the prior works assume a conventional cache hierarchy where each GPU core has a private local L1 cache and all cores share the L2 cache. Our analysis shows that this canonical organization does not allow optimal utilization of caches because the private nature of L1 caches allows multiple copies of the same cache line to get replicated across cores. We introduce a new shared L1 cache organization, where all cores collectively cache a single copy of the data at only one location (core), leading to zero data replication. We achieve this by allowing each core to cache only a non-overlapping slice of the entire address range. Such a design is useful for significantly improving the collective L1 hit rates but incurs latency overheads from additional communications when a core requests data not allowed to be present in its own cache. While many workloads can tolerate this additional latency, several workloads show performance sensitivities. Therefore, we develop lightweight communication optimization techniques and a run-time mechanism that considers the latency-tolerance characteristics of applications to decide which applications should execute in private versus shared L1 cache organization and reconfigures the caches accordingly. In effect, we achieve significant performance and energy efficiency improvements, at a modest hardware cost, for applications that prefer the shared organization, with little to no impact on other applications. CCS CONCEPTS • Computer systems organization → Single instruction, multiple data.

show abstract

Section: Sensitivity Studiesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Analyzing and Leveraging Shared L1 Caches in GPUs

Ibrahim

Kayıran

Eckert

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…of data sets for our replication-sensitive applications. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the baseline shared LLC organization and state-of-the-art Adaptive LLC [75], respectively. In summary, we make the following major contributions:…”

Section: Introductionmentioning

confidence: 97%

“…Unfortunately, this significantly reduces the effective LLC capacity as copies of shared data are now stored in multiple LLC slices. The state-of-the-art Adaptive LLC [75] dynamically selects either a shared or private organization based on application behavior. Unfortunately, this all-or-nothing approach only addresses the LLC serialization problem when the shared data set is small enough to fit in the LLC with maximum replication.…”

Section: Introductionmentioning

confidence: 99%

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized-resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we propose the Selective Replication (SelRep) LLC which selectively replicates shared read-only data across LLC slices to improve bandwidth supply while ensuring that the LLC retains sufficient capacity to keep shared data cached. The compile-time component of SelRep LLC uses dataflow analysis to identify read-only shared data structures and uses a special-purpose load instruction for these accesses. The runtime component of SelRep LLC then monitors the caching behavior of these loads. Leveraging an analytical model, SelRep LLC chooses a replication degree that carefully balances the effective LLC bandwidth benefits of replication against its capacity cost. SelRep LLC consistently provides high performance to replication-sensitive applications across different data set sizes. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the shared LLC baseline and the state-of-the-art Adaptive LLC, respectively.

show abstract

Criticality-aware priority to accelerate GPU memory access

Bitalebi

Safaei

2022

J Supercomput

View full text Add to dashboard Cite

Adaptive memory-side last-level GPU caching

Cited by 15 publications

References 79 publications

Analyzing and Leveraging Shared L1 Caches in GPUs

Analyzing and Leveraging Shared L1 Caches in GPUs

Selective Replication in Memory-Side GPU Caches

Criticality-aware priority to accelerate GPU memory access

Contact Info

Product

Resources

About