Evaluating cache coherent shared virtual memory for heterogeneous multicore chips

Hechtman, Blake A.; Sorin, Daniel J.

doi:10.1109/ispass.2013.6557152

Cited by 15 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In case of embedded platforms with shared system DRAM, using the CE basically means duplicating the same buffer twice on the same memory device. Both CUDA and OpenCL programming models specify alternatives to the CE approach to avoid explicit memory transfers and unnecessary buffer replications, such as CUDA UVM (Unified Virtual Memory [14]) and OpenCL 2.0 SVM (Shared Virtual Memory [15]). However, these approaches introduce CPU-iGPU memory coherency problems when accessing the same shared memory buffer, so that avoiding copy engines does not necessarily lead to performance improvements 1 For this reason, we will characterize the contention originated in both CE-and non-CE-based models.…”

Section: Socs Specifications and Contention Pointsmentioning

confidence: 99%

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

Cavicchioli

Capodieci

Bertogna

2017

2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)

View full text Add to dashboard Cite

show abstract

Section: Socs Specifications and Contention Pointsmentioning

confidence: 99%

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

Cavicchioli

Capodieci

Bertogna

2017

2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA)

View full text Add to dashboard Cite

show abstract

“…This is especially problematic for pointer-based data structures (e.g., linked lists, trees) 1 . Recent work tries to address this using various smarter memory management schemes [20,21,25,26]. Furthermore, latest CUDA releases permit limited CPU/GPU virtual address sharing [57].…”

Section: Address Translation On Cpu/gpusmentioning

confidence: 99%

Architectural support for address translation on GPUs

Pichai

Hsu

Bhattacharjee

2014

Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems

114

View full text Add to dashboard Cite

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption.A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory.To this end, we explore GPU Memory Management Units (MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) in unified heterogeneous systems. We show the challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1-parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15% of runtime). We presume this initial design leaves room for improvement but anticipate the bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this area.

show abstract

“…There have been many coherence extensions proposed over the years (discussed further in Section II), but these generally build upon conventional hardware protocols originally designed for CPUs such as MESI. Such protocols are effective for a wide range of CPU workloads, but these complex coherence strategies often incur unacceptable overheads for accelerators such as GPUs [34], [33], [75]. In addition, the complexity of MESI-based protocols makes validating protocol changes expensive, requiring that the cost of any coherence extension be amortized over a broad range of general-purpose applications.…”

Section: Introductionmentioning

confidence: 99%

A Case for Fine-grain Coherence Specialization in Heterogeneous Systems

Alsop¹,

Na²,

Sinclair³

et al. 2021

Preprint

View full text Add to dashboard Cite

Hardware specialization is becoming a key enabler of energyefficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More recently, industry has proposed unified coherent memory which enables implicit data movement and more data reuse, but often these interfaces limit the coherence flexibility available to heterogeneous systems.This paper demonstrates the benefits of fine-grained coherence specialization for heterogeneous systems. We propose an architecture that enables low-complexity independent specialization of each individual coherence request in heterogeneous workloads by building upon a simple and flexible baseline coherence interface, Spandex. We then describe how to optimize individual memory requests to improve cache reuse and performance-critical memory latency in emerging heterogeneous workloads. Collectively, our techniques enable significant gains, reducing execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the Spandex protocol.

show abstract

Evaluating cache coherent shared virtual memory for heterogeneous multicore chips

Cited by 15 publications

References 34 publications

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms

Architectural support for address translation on GPUs

A Case for Fine-grain Coherence Specialization in Heterogeneous Systems

Contact Info

Product

Resources

About