An investigation of Unified Memory Access performance in CUDA

Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin C.

doi:10.1109/hpec.2014.7040988

Cited by 83 publications

(48 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Gelado et al [5] presented a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory. Nickolls et al [9] investigated the Unified Memory programming model and evaluate the performance. However, he only tested one benchmark suite and did not analyze the reason for the performance loss.…”

Section: Discussionmentioning

confidence: 99%

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Jin

Cui

et al. 2015

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Unified Memory is an emerging technology which is supported by CUDA 6.X. Before CUDA 6.X, the existing CUDA programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. CUDA 6.X provides a new technology which is called as Unified Memory to provide a new programming model that defines CPU and GPU memory space as a single coherent memory (imaging as a same common address space). The system manages data access between CPU and GPU without explicit memory copy functions. This paper is to evaluate the Unified Memory technology through different applications on different GPUs to show the users how to use the Unified Memory technology of CUDA 6.X efficiently. The applications include Diffusion3D Benchmark, Parboil Benchmark Suite, and Matrix Multiplication from the CUDA SDK Samples. We changed those applications to corresponding Unified Memory versions and compare those with the original ones. We selected the NVIDIA Kepler K40 and the Jetson TK1, which can represent the latest GPUs with Kepler architecture and the first mobile platform of NVIDIA series with Kepler GPU. This paper shows that Unified Memory versions cause 10% performance loss on average. Furthermore, we used the NVIDIA Visual Profiler to dig the reason of the performance loss by the Unified Memory technology. 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 978-1-4799-8006-2/15 $31.00

show abstract

Section: Discussionmentioning

confidence: 99%

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Jin

Cui

et al. 2015

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

show abstract

“…Typically, solutions that increase flexibility and ease of programming impose a certain performance overhead. The authors of [14] thoroughly tested the UM mechanism. They incorporated several benchmarks, both those written by the authors but also the Rodinia benchmark set.…”

Section: Unified Memorymentioning

confidence: 99%

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Jarząbek

Czarnul

2017

J Supercomput

View full text Add to dashboard Cite

The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.

show abstract

“…This work was targeted at optimizing small message transfers and was further extended by Shi et al in [14] where the authors showed how some of the new techniques such as NIC loopback and Fastcopy could enable faster transfer of eager messages with higher performance. In a recent work done by Landaverde et al in [8], the authors have done a performance evaluation of the CUDA managed memory from an applications perspective. The authors state that even though the programming productivity is high due to the on-demand fetching of data, the performance of managed memory is poor which severely restricts its flexibility and adding future optimizations.…”

Section: Related Workmentioning

confidence: 99%

Designing high performance communication runtime for GPU managed memory

Banerjee

Hamidouche

Panda

2016

Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have gained the position of a main stream accelerator due to its low power footprint and massive parallelism. CUDA 6.0 onward, NVIDIA has introduced the Managed Memory capability which unifies the host and device memory allocations into a single allocation and removes the requirement for explicit memory transfers between either memories. Several applications particularly of irregular nature can have immense benefits from managed memory because of the high productivity in programming that can be achieved owing to the minimal effort involved in the data management and movement. The MVAPICH2 library utilizes runtime designs such as CUDA Inter Process Communications (IPC) and GPUDirect RDMA (GDR) under the CUDA-Aware concept, to offer high productivity and programmability with MPI on modern clusters. However, integration and interaction of managed memory with these features raises challenges for efficient small and large message communications.In this study, we present an initial evaluation of managed memory capability and its interaction with existing high performance designs and features available in MVAPICH2 library. We propose new designs to enable efficient communication support between managed memory buffers. We also perform fine tuning to optimize the transfers between managed memories residing in GPUs. To the best of our knowledge, this is the first evaluation and study of managed memory and its interaction with MPI runtimes. A detailed evaluation and analysis of the performance of the proposed designs is presented. The Stencil2D communication kernel available in the SHOC suite was re-designed to enable the managed memory support. The evaluation shows a 4x improvement in the timings of stencil exchanges on 16 GPU nodes.

show abstract

An investigation of Unified Memory Access performance in CUDA

Cited by 83 publications

References 7 publications

An Evaluation of Unified Memory Technology on NVIDIA GPUs

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Designing high performance communication runtime for GPU managed memory

Contact Info

Product

Resources

About