Efficiently enforcing strong memory ordering in GPUs

Singh, Abhayendra; Aga, Shaizeen; Narayanasamy, Satish

doi:10.1145/2830772.2830778

Cited by 15 publications

(12 citation statements)

References 46 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, in a GPU with small number of CUs, an inclusive directory at GL2 to keep track of sharers will not incur a significant overhead. Also, timestamp coherence can be used for reducing coherence traffic overhead [73] and private-shared memory access classification [71,72] can be used for reducing mutex requirements but we leave these explorations to future work. LSC and Previous GPU SC Implementations: Singh et al [71] proposed efficient SC implementation for GPUs by extending the work of Singh et al [72] for CPUs.…”

Section: Discussionmentioning

confidence: 99%

“…Also, timestamp coherence can be used for reducing coherence traffic overhead [73] and private-shared memory access classification [71,72] can be used for reducing mutex requirements but we leave these explorations to future work. LSC and Previous GPU SC Implementations: Singh et al [71] proposed efficient SC implementation for GPUs by extending the work of Singh et al [72] for CPUs. This work implemented SC for wavefront instructions (warp instructions) and argued that SC ordering need not be preserved across per-work-item (per-thread) instructions that execute in lockstep fashion.…”

Section: Discussionmentioning

confidence: 99%

“…Researchers have proposed various techniques to implement SC on CPU cores without significant performance penalty [22, 28, 29, 39-41, 51, 52, 64, 72, 78]. Recent academic research has also shown that enforcing SC on a GPU has minimum hardware complexity and SC can match the performance of weaker models [44,65,71]. For example, Hechtman et al [44] studied the characteristics of heavily threaded applications and demonstrated that a simple SC implementation achieves performance comparable to weaker models.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Systems-on-Chip with Strong Ordering

Puthoor

Lipasti

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Systems-on-Chip with Strong Ordering

Puthoor

Lipasti

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…In addition, previous work attempts to improve the performance and programmability of GPUs by supporting transactional memory [10,11,15,16,37,45] and by providing memory consistency and memory coherence on GPUs [5,19,36,[38][39][40].…”

Section: Gpu Solutionsmentioning

confidence: 99%

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. CCS Concepts • Computer systems organization → Single instruction, multiple data; • Software and its engineering → Mutual exclusion; Message passing.

show abstract

“…Memory consistency models have not been formally defined on GPUs [19]. Until recently, Heterogeneous System Architecture (HSA) Foundation [20] and OpenCL [21] start to adopt the C11's datarace-free-0 (DRF-0) model, which guarantees sequential consistency (SC) for data-race-free code, but is undefined for the cases with data-races.…”

Section: Gpu Architecture and Programming Modelmentioning

confidence: 99%

Exploring Memory Persistency Models for GPUs

Lin

Alshboul

Solihin

et al. 2019

2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

Given its high integration density, high speed, byte addressability, and low standby power, nonvolatile or persistent memory is expected to supplement/replace DRAM as main memory. Through persistency programming models (which define durability ordering of stores) and durable transaction constructs, the programmer can provide recoverable data structure (RDS) which allows programs to recover to a consistent state after a failure. While persistency models have been well studied for CPUs, they have been neglected for graphics processing units (GPUs). Considering the importance of GPUs as a dominant accelerator for high performance computing, we investigate persistency models for GPUs. GPU applications exhibit substantial differences with CPUs applications, hence in this paper we adapt, re-architect, and optimize CPU persistency models for GPUs. We design a pragma-based compiler scheme to express persistency models for GPUs. We identify that the thread hierarchy in GPUs offers intuitive scopes to form epochs and durable transactions. We find that undo logging produces significant performance overheads. We propose to use idempotency analysis to reduce both logging frequency and the size of logs. Through both real-system and simulation evaluations, we show low overheads of our proposed architecture support.

show abstract

Efficiently enforcing strong memory ordering in GPUs

Cited by 15 publications

References 46 publications

Systems-on-Chip with Strong Ordering

Systems-on-Chip with Strong Ordering

Fast Fine-Grained Global Synchronization on GPUs

Exploring Memory Persistency Models for GPUs

Contact Info

Product

Resources

About