Proceedings of the 48th International Symposium on Microarchitecture 2015
DOI: 10.1145/2830772.2830778
|View full text |Cite
|
Sign up to set email alerts
|

Efficiently enforcing strong memory ordering in GPUs

Abstract: GPU programming models such as CUDA and OpenCL are starting to adopt a weaker data-race-free (DRF-0) memory model, which does not guarantee any semantics for programs with data-races. Before standardizing the memory model interface for GPUs, it is imperative that we understand the tradeoffs of different memory models for these devices. While there is a rich memory model literature for CPUs, studies on architectural mechanisms and performance costs for enforcing memory ordering constraints in GPU accelerators h… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 46 publications
0
12
0
Order By: Relevance
“…For example, in a GPU with small number of CUs, an inclusive directory at GL2 to keep track of sharers will not incur a significant overhead. Also, timestamp coherence can be used for reducing coherence traffic overhead [73] and private-shared memory access classification [71,72] can be used for reducing mutex requirements but we leave these explorations to future work. LSC and Previous GPU SC Implementations: Singh et al [71] proposed efficient SC implementation for GPUs by extending the work of Singh et al [72] for CPUs.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…For example, in a GPU with small number of CUs, an inclusive directory at GL2 to keep track of sharers will not incur a significant overhead. Also, timestamp coherence can be used for reducing coherence traffic overhead [73] and private-shared memory access classification [71,72] can be used for reducing mutex requirements but we leave these explorations to future work. LSC and Previous GPU SC Implementations: Singh et al [71] proposed efficient SC implementation for GPUs by extending the work of Singh et al [72] for CPUs.…”
Section: Discussionmentioning
confidence: 99%
“…Also, timestamp coherence can be used for reducing coherence traffic overhead [73] and private-shared memory access classification [71,72] can be used for reducing mutex requirements but we leave these explorations to future work. LSC and Previous GPU SC Implementations: Singh et al [71] proposed efficient SC implementation for GPUs by extending the work of Singh et al [72] for CPUs. This work implemented SC for wavefront instructions (warp instructions) and argued that SC ordering need not be preserved across per-work-item (per-thread) instructions that execute in lockstep fashion.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In addition, previous work attempts to improve the performance and programmability of GPUs by supporting transactional memory [10,11,15,16,37,45] and by providing memory consistency and memory coherence on GPUs [5,19,36,[38][39][40].…”
Section: Gpu Solutionsmentioning
confidence: 99%
“…Memory consistency models have not been formally defined on GPUs [19]. Until recently, Heterogeneous System Architecture (HSA) Foundation [20] and OpenCL [21] start to adopt the C11's datarace-free-0 (DRF-0) model, which guarantees sequential consistency (SC) for data-race-free code, but is undefined for the cases with data-races.…”
Section: Gpu Architecture and Programming Modelmentioning
confidence: 99%