2011 IEEE International Conference on Cluster Computing 2011
DOI: 10.1109/cluster.2011.34
|View full text |Cite
|
Sign up to set email alerts
|

Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Abstract: Abstract-Atomic operations are important building blocks in supporting general-purpose computing on graphics processing units (GPUs). For instance, they can be used to coordinate execution between concurrent threads, and in turn, assist in constructing complex data structures such as hash tables or implementing GPU-wide barrier synchronization.While the performance of atomic operations has improved substantially on the latest NVIDIA Fermi-based GPUs, system-provided atomic operations still incur significant pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 23 publications
(7 citation statements)
references
References 11 publications
0
7
0
Order By: Relevance
“…The current OpenCL compiler maps all kernel data into a single unordered access view. Consequently, including a single atomic operation in a kernel may force all memory loads and stores to follow the CompletePath instead of the FastPath, which can in turn cause severe performance degradation of an application as discovered by our previous study [8]. Note that atomic operations on variables stored in the local memory does not impact the selection of memory path.…”
Section: B Memory Pathsmentioning
confidence: 94%
“…The current OpenCL compiler maps all kernel data into a single unordered access view. Consequently, including a single atomic operation in a kernel may force all memory loads and stores to follow the CompletePath instead of the FastPath, which can in turn cause severe performance degradation of an application as discovered by our previous study [8]. Note that atomic operations on variables stored in the local memory does not impact the selection of memory path.…”
Section: B Memory Pathsmentioning
confidence: 94%
“…As discussed previously and documented by Elteir et al [7], global operations are prohibitively expensive on AMD hardware. It may be viable on hardware from other vendors or future generations of GPUs.…”
Section: Load Balancing On Multiple Compute Unitsmentioning
confidence: 94%
“…In the preceeding discussion, we excluded the copy overhead and kernel launch overhead for any of the GPU configurations; we report kernel execution only 7 . Our reference graph implementation adds some additional overhead outside the mark phase.…”
Section: Overheads Of Our Implementationmentioning
confidence: 99%
“…In ARM and x86 architectures, generic atomic instructions incur substantial overhead because of their consistency and ILP restrictions [28,44]. Moreover, AMD and NVIDIA GPU architectures contain this overhead [10,35]. To ascertain the extent of the overhead in atomic instructions, similar to Nai's evaluation [28], we conducted a real machine experiment on an Intel Xeon E5-2620 using graph-processing kernels.…”
Section: Preventing the Overhead Of Host Atomic Instructionsmentioning
confidence: 99%