A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Lustig, Daniel; Sahasrabuddhe, Sameer D.; Giroux, Olivier

doi:10.1145/3297858.3304043

Cited by 31 publications

(14 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, if there are two or more stores to overlapping locations from a lockstep execution, a determination as to which one was the last store to that location is not possible and hence the value of the subsequent load to that location is undefined. This is similar to the reasoning provided in prior works where the outcome of racey accesses that "occur at the same time" are undefined [5,23,36,46,55]. As a result, LSC does not impose any restriction on the value of overlapping stores from a lockstep execution.…”

Section: Hardware Design Implicationssupporting

confidence: 64%

“…HRF defines scopes in terms of the execution hierarchy of GPUs. For example, work-items within the same work-group (threadblock) synchronize through work-group scope, and work-items from different work-groups synchronize through device scope (scopes are present in other models as well [54]). While use of such scopes are well defined for synchronizing between work-items of a GPU, the synchronization between work-items and threads running on other processing elements on the same GPU is not clearly defined.…”

Section: Limitations Of Hrfmentioning

confidence: 99%

See 1 more Smart Citation

Systems-on-Chip with Strong Ordering

Puthoor

Lipasti

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1

show abstract

Section: Hardware Design Implicationssupporting

confidence: 64%

Section: Limitations Of Hrfmentioning

confidence: 99%

Systems-on-Chip with Strong Ordering

Puthoor

Lipasti

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Memory Modelling. CPU memory models such as x86 [Owens et al 2009], POWER , Arm [Pulte et al 2017], and RISC-V [Pulte et al 2019] are now fairly well understood, as are some GPU memory models [Alglave et al 2015;Lustig et al 2019]. However, these models do not apply to systems where threads are on different devices.…”

Section: Further Related Workmentioning

confidence: 99%

The semantics of shared memory in Intel CPU/FPGA systems

Iorga

Donaldson

Sorensen

et al. 2021

Proc. ACM Program. Lang.

View full text Add to dashboard Cite

Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory, are becoming popular in several computing sectors. In this paper, we study the shared-memory semantics of these devices, with a view to providing a firm foundation for reasoning about the programs that run on them. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe the weak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGA thread access common memory locations in a fine-grained manner through multiple channels. Some of these behaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encode these behaviours in two formal memory models: one operational, one axiomatic. We develop executable implementations of both models, using the CBMC bounded model-checking tool for our operational model and the Alloy modelling language for our axiomatic model. Using these, we cross-check our models against each other via a translator that converts Alloy-generated executions into queries for the CBMC model. We also validate our models against actual hardware by translating 583 Alloy-generated executions into litmus tests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising a hardware design per litmus test by creating our own 'litmus-test processor' in hardware. We expect that our models will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as a demonstration of the utility of our work, we use our operational model to reason about a producer/consumer buffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation -- a situation that our model is able to detect -- we observe that its performance improves at the cost of occasional data corruption.

show abstract

“…Modelling the concurrency aspects of the Armv8 architecture entails developing a consistency model for Armv8. Consistency models determine what values a read can take; weak consistency models such as the ones of Arm [4,23], IBM [37,38], Intel [39,40], Nvidia [10,33], RISC-V [3], C++ [20,31], Linux [15], and others allow more behaviours than Sequential Consistency (SC) [32].…”

Section: Design Principles and Rationalementioning

confidence: 99%

Armed Cats

Alglave

Deacon²,

Grisenthwaite³

et al. 2021

ACM Trans. Program. Lang. Syst.

View full text Add to dashboard Cite

We report on the process for formal concurrency modelling at Arm. An initial formal consistency model of the Arm achitecture, written in the cat language, was published and upstreamed to the herd+diy tool suite in 2017. Since then, we have extended the original model with extra features, for example, mixed-size accesses, and produced two provably equivalent alternative formulations. In this article, we present a comprehensive review of work done at Arm on the consistency model. Along the way, we also show that our principle for handling mixed-size accesses applies to x86: We confirm this via vast experimental campaigns. We also show that our alternative formulations are applicable to any model phrased in a style similar to the one chosen by Arm.

show abstract

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Cited by 31 publications

References 33 publications

Systems-on-Chip with Strong Ordering

Systems-on-Chip with Strong Ordering

The semantics of shared memory in Intel CPU/FPGA systems

Armed Cats

Contact Info

Product

Resources

About