Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

Mukherjee, Saoni; Baruah, Trinayan; Dong, Shi; Gutiérrez, Julián; Mohan, Prannoy; Kaeli, David

doi:10.1109/ispass.2018.00034

Cited by 26 publications

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cabezas et al [55] showed a software solution, including programming interfaces, compiler support and runtime, to partition GPU kernels for multi-GPU execution in a single node. Finally, Sun et al [56] evaluated the potential performance benefit and tradeoffs of AMD's Radeon Open Compute (ROC) platform for Heterogeneous System Architecture (HSA).…”

Section: Related Workmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

142

View full text Add to dashboard Cite

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

show abstract

Section: Related Workmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

142

View full text Add to dashboard Cite

show abstract

“…On the other hand, HIP is claimed to achieve excellent performance still being compatible with Nvidia GPUs. Sun and coauthors (Sun et al, 2018) evaluated performance options of the ROCm platform using general CPU-GPU benchmarks and machine learning benchmarks. They found HIP to be best performing high-level framework for AMD devices and confirmed that HIP has close to zero overhead over CUDA on Nvidia GPU and thus provides both performance and portability.…”

Section: Related Workmentioning

confidence: 99%

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

Kondratyuk

Nikolskiy

Pavlov

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.

show abstract

“…Convolutional layers dominate the overall DNN training time. In particular, the convolutional layers alone can contribute to approximately 90% of the training time [21], [30]. Therefore, in this paper we are focused on improving GEMM-based convolutional layer performance, given its dominance on training performance.…”

Section: Characterization Of Sparse Matrix Operationsmentioning

confidence: 99%

Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs

Dong

Agostini

Karimi

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and the data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this paper, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5× on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4×, VGGNet-16 by 2.14×, and ResNet-18 by 2.02×, when training on the ImageNet dataset.

show abstract

Evaluating Performance Tradeoffs on the Radeon Open Compute Platform

Cited by 26 publications

References 15 publications

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

Spartan: A Sparsity-Adaptive Framework to Accelerate Deep Neural Network Training on GPUs

Contact Info

Product

Resources

About