DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Ji, Feng; Aji, Ashwin M.; Dinan, James; Buntinas, Darius; Balaji, Pavan; Thakur, Rajeev; Feng, Wu-chun; Ma, Xiaosong

doi:10.1109/hpcc.2012.69

Cited by 11 publications

(11 citation statements)

References 14 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This way, there is no need to stage the GPU data in and out of the host memory, which can significantly enhance the performance of intranode inter-process GPU-to-GPU communication. Previous research has used CUDA IPC to optimize point-to-point and one-sided communications in MPI [12,5]. However, to the best of our knowledge, CUDA IPC has not been used in the design of collective operations.…”

Section: Introductionmentioning

confidence: 99%

“…It has been shown that intranode and internode communications between GPUs in HPC platforms play an important role in the performance of scientific applications [1,10]. In this regard, researchers have started looking into incorporating GPU-awareness into the MPI library, targeting both point-to-point and collective communications [12,5,14,11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GPU-Aware Intranode MPI_Allreduce

Faraji

Afsahi

2014

Proceedings of the 21st European MPI Users' Group Meeting

View full text Add to dashboard Cite

Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI applications. This paper investigates the efficient design of intranode MPI Allreduce operation in GPU clusters. We propose two design alternatives that exploit in-GPU reduction and fast intranode communication capabilities of modern GPUs. Our GPU shared-buffer aware design and GPU-aware Binomial reduce-broadcast algorithmic approach provide significant speedup over MVAPICH2 by up to 22 and 16 times, respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

GPU-Aware Intranode MPI_Allreduce

Faraji

Afsahi

2014

Proceedings of the 21st European MPI Users' Group Meeting

View full text Add to dashboard Cite

show abstract

“…In addition to enhanced programmability, transparent architecture specific and vendor specific performance optimizations can be provided within the MPI layer. For example, MPI-ACC enables automatic data pipelining for internode communication, NUMA affinity management, and direct GPU-to-GPU data movement (GPUDirect) for all applicable intranode CUDA communications [6,19], thus providing a heavily optimized end-to-end communication platform.…”

Section: Application Design Using Gpu-integrated Mpi Frameworkmentioning

confidence: 99%

“…All-to-all communication [27] and noncontiguous datatype communication [17,29] have also been studied in the context of GPUaware MPI. With a focus on intranode communication, our previous work [18,19] extends transparent GPU buffers support for MPICH [1] and optimizes the cross-PCIe data movement by using shared memory data structures and interprocess communication (IPC) mechanisms. In contrast to those efforts, here we study the synergistic effect between GPU-accelerated MPI applications and a GPU-integrated MPI implementation.…”

Section: Related Workmentioning

confidence: 99%

“…Also, significant programmer effort would be required to recover this performance through vendor-and system-specific optimizations, including GPU-Direct [3] and node and I/O topology awareness. Consequently, GPU-aware extensions to parallel programming models, such as the Message Passing Interface (MPI), have recently been developed, for example, MPI-ACC [6,19] and MVAPICH2-GPU [28]. While such libraries provide a unified and highly efficient data communication mechanism for point-to-point, one-sided, and collective communications among CPUs and GPUs, an in-depth characterization of their impact on the execution profiles of scientific applications is yet to be performed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the efficacy of GPU-integrated MPI for scientific applications

Aji

Panwar

et al. 2013

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU memories, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2-GPU, have been developed that provide unified programming interfaces and optimized implementations for end-to-end data communication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated communication systems for scientific applications.In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depending on the application's computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the overall cluster utilization. Furthermore, MPI-ACC internally implements scalable memory management techniques, thereby decoupling the low-level memory optimizations from the applications and making them scalable and portable across several architectures. Experimental results from a state-of-the-art cluster with hundreds of GPUs show that the MPI-ACC-driven new applicationspecific optimizations can improve the performance of an epidemiology simulation by up to 61.6% and the performance of a seismology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer producPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

show abstract

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary GPU accelerators have established themselves in the state‐of‐the‐art clusters by offering high performance and energy efficiency. In such systems, efficient inter‐process GPU communication is of paramount importance to application performance. This paper investigates various algorithms in conjunction with the latest GPU features to improve GPU collective operations. First, we propose a GPU Shared Buffer‐aware (GSB) algorithm and a Binomial Tree Based (BTB) algorithm for GPU collectives on single‐GPU nodes. We then propose a hierarchical framework for clusters with multi‐GPU nodes. By studying various combinations of algorithms, we highlight the importance of choosing the right algorithm within each level. The evaluation of our framework on MPI_Allreduce shows promising performance results for large message sizes. To address the shortcoming for small and medium messages, we present the benefit of using the Hyper‐Q feature and the MPS service in jointly using CUDA IPC and host‐staged copy types to perform multiple inter‐process communications. However, we argue that efficient designs are still required to further harness this potential. Accordingly, we propose a static and a dynamic algorithm for MPI_Allgather and MPI_Allreduce and present their effectiveness on various message sizes. Our profiling results indicate that the achieved performance is mainly rooted in overlapping different copy types.

show abstract

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Cited by 11 publications

References 14 publications

GPU-Aware Intranode MPI_Allreduce

GPU-Aware Intranode MPI_Allreduce

On the efficacy of GPU-integrated MPI for scientific applications

Design considerations for GPU‐aware collective communications in MPI

Contact Info

Product

Resources

About