Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Elteir, Marwa K.; Lin, Heshan; Feng, Wu-chun

doi:10.1109/cluster.2011.34

Cited by 23 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current OpenCL compiler maps all kernel data into a single unordered access view. Consequently, including a single atomic operation in a kernel may force all memory loads and stores to follow the CompletePath instead of the FastPath, which can in turn cause severe performance degradation of an application as discovered by our previous study [8]. Note that atomic operations on variables stored in the local memory does not impact the selection of memory path.…”

Section: B Memory Pathsmentioning

confidence: 94%

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Elteir

Lin

Feng

et al. 2011

2011 IEEE 17th International Conference on Parallel and Distributed Systems

Self Cite

View full text Add to dashboard Cite

Abstract-MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU.Our investigation reveals that the design and mapping of the MapReduce framework needs to be revisited for AMD GPUs due to their notable architectural differences from NVIDIA GPUs. For instance, current state-of-the-art MapReduce implementations employ atomic operations to coordinate the execution of different threads. However, atomic operations can implicitly cause inefficient memory access, and in turn, severely impact performance. In this paper, we propose StreamMR, an OpenCL MapReduce framework optimized for AMD GPUs. With efficient atomic-free algorithms for output handling and intermediate result shuffling, StreamMR is superior to atomic-based MapReduce designs and can outperform existing atomic-free MapReduce implementations by nearly five-fold on an AMD Radeon HD 5870.

show abstract

Section: B Memory Pathsmentioning

confidence: 94%

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Elteir

Lin

Feng

et al. 2011

2011 IEEE 17th International Conference on Parallel and Distributed Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…As discussed previously and documented by Elteir et al [7], global operations are prohibitively expensive on AMD hardware. It may be viable on hardware from other vendors or future generations of GPUs.…”

Section: Load Balancing On Multiple Compute Unitsmentioning

confidence: 94%

“…In the preceeding discussion, we excluded the copy overhead and kernel launch overhead for any of the GPU configurations; we report kernel execution only 7 . Our reference graph implementation adds some additional overhead outside the mark phase.…”

Section: Overheads Of Our Implementationmentioning

confidence: 99%

GPUs as an opportunity for offloading garbage collection

Maas

Reames

Morlan

et al. 2012

Proceedings of the 2012 International Symposium on Memory Management

View full text Add to dashboard Cite

GPUs have become part of most commodity systems. Nonetheless, they are often underutilized when not executing graphicsintensive or special-purpose numerical computations, which are rare in consumer workloads. Emerging architectures, such as integrated CPU/GPU combinations, may create an opportunity to utilize these otherwise unused cycles for offloading traditional systems tasks. Garbage collection appears to be a particularly promising candidate for offloading, due to the popularity of managed languages on consumer devices.We investigate the challenges for offloading garbage collection to a GPU, by examining the performance trade-offs for the mark phase of a mark & sweep garbage collector. We present a theoretical analysis and an algorithm that demonstrates the feasibility of this approach. We also discuss a number of algorithmic design trade-offs required to leverage the strengths and capabilities of the GPU hardware. Our algorithm has been integrated into the Jikes RVM and we present promising performance results.

show abstract

“…In ARM and x86 architectures, generic atomic instructions incur substantial overhead because of their consistency and ILP restrictions [28,44]. Moreover, AMD and NVIDIA GPU architectures contain this overhead [10,35]. To ascertain the extent of the overhead in atomic instructions, similar to Nai's evaluation [28], we conducted a real machine experiment on an Intel Xeon E5-2620 using graph-processing kernels.…”

Section: Preventing the Overhead Of Host Atomic Instructionsmentioning

confidence: 99%

Cairo

Hadidi

Nai

Kim

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processingin-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the memory. Several memory vendors have also started to integrate computation logics into the memory, such as Hybrid Memory Cube (HMC), the latest version of which supports up to 18 in-memory atomic instructions. Although industry prototypes have motivated studies for investigating efficient methods and architectures for PIM, researchers have not proposed a systematic way for identifying the benefits of instruction-level PIM offloading. As a result, compiler support for recognizing offloading candidates and utilizing instruction-level PIM offloading is unavailable. In this article, we analyze the advantages of instruction-level PIM offloading in the context of HMC-atomic instructions for graphcomputing applications and propose CAIRO, a compiler-assisted technique and decision model for enabling instruction-level offloading of PIM without any burden on programmers. To develop CAIRO, we analyzed how instruction offloading enables performance gain in both CPU and GPU workloads. Our studies show that performance gain from bandwidth savings, the ratio of number of cache misses to total cache accesses, and the overhead of host atomic instructions are the key factors in selecting an offloading candidate. Based on our analytical models, we characterize the properties of beneficial and nonbeneficial candidates for offloading. We evaluate CAIRO with 27 multithreaded CPU and 36 GPU benchmarks. In our evaluation, CAIRO not only doubles the speedup for a set of PIM-beneficial workloads by exploiting HMC-atomic instructions but also prevents slowdown caused by incorrect offloading decisions for other workloads. CCS Concepts: • Hardware → 3D integrated circuits; Emerging architectures; Memory and dense storage; • Software and its engineering → Compilers;

show abstract

Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Cited by 23 publications

References 11 publications

StreamMR: An Optimized MapReduce Framework for AMD GPUs

StreamMR: An Optimized MapReduce Framework for AMD GPUs

GPUs as an opportunity for offloading garbage collection

Cairo

Contact Info

Product

Resources

About