On the efficacy of GPU-integrated MPI for scientific applications

Aji, Ashwin M.; Panwar, Lokendra S.; Ji, Feng; Chabbi, Milind; Murthy, Karthik; Balaji, Pavan; Bisset, Keith R.; Dinan, James; Feng, Wu-chun; Mellor-Crummey, John; Ma, Xiaosong; Thakur, Rajeev

doi:10.1145/2493123.2462915

Cited by 20 publications

(10 citation statements)

References 25 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, there are two GPU computation modes depending on how the visit messages are processed on the GPUs. In this paper, we discuss the exclusive GPU computation mode, but discussion of the cooperative CPU-GPU computation mode can be found in our prior work [27].…”

Section: Computation-communication Patterns and Mpi-acc-driven Optimimentioning

confidence: 99%

“…We compare the combined performance of all the phases of GPU-EpiSimdemics (computeVisits and computeInteractions), with and without the MPI-ACC-driven optimizations discussed in Section 5. analysis of CA is described in our prior work [27]. We also vary the number of compute nodes from 8 to 128 and the number of GPU devices between 1 and 2.…”

Section: Case Study Analysis: Episimdemicsmentioning

confidence: 99%

See 1 more Smart Citation

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji¹,

Panwar²,

Ji³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-toend data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

show abstract

Section: Computation-communication Patterns and Mpi-acc-driven Optimimentioning

confidence: 99%

Section: Case Study Analysis: Episimdemicsmentioning

confidence: 99%

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji¹,

Panwar²,

Ji³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…GeMTC could benefit from a Grophecy or Singe-like module for creating warp-optimized AppKernels and vice versa. MPI-ACC [41] aims to provide integrated MPI support for accelerators to allow the programmer to easily execute code on a CPU or GPU.…”

Section: Related Workmentioning

confidence: 99%

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

Krieder

Wozniak

Armstrong

et al. 2014

Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such "many-task" applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.

show abstract

“…State-of-the-art techniques that combine distributed-and shared-memory programming models [80], as well as many PGAS approaches [6,24,47,48], have demon-strated the potential benefits of combining both levels of parallelism [81,82,39,83], including increased communication-computation overlap [84,85], improved memory utilization [86,87], power optimization [88] and effective use of accelerators [89,90,91,92]. The hybrid MPI and thread model, such as MPI and OpenMP, can take advantage of those optimized shared-memory algorithms and data structures.…”

Section: Chapter 4 Habanero-c Runtime Communication Systemmentioning

confidence: 99%

Runtime Systems for Extreme Scale Platforms

Chatterjee¹

2013

View full text Add to dashboard Cite

Future extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(10 3 ) cores per node and O(10 6 ) nodes overall.Effective combination of inter-node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focussed on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems.In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing. First, we address the challenge of integrating an intra-node task parallel runtime with a communication system for scalable performance. We present a runtime communication system, called HC-COMM, designed to use dedicated communication cores on a system. We introduce the HCMPI programming model which integrates the Habanero-C asynchronous dynamic task parallel language with the MPI message passing communication model on the HC-COMM runtime. We also introduce the HAPGNS model that enables data flow programming for extreme-scale systems in which the user does not require knowledge of MPI. Second, we address the challenge of separating locality optimizations from a programmer with domain specific knowledge. We present a tuning framework, through which performance experts can optimize existing applications by specifying runtime operations aimed at co-scheduling of affinitized tasks. Finally, we address the challenge of scalable synchronization for long running tasks on a dynamic task parallel runtime. We use the phaser construct to present a generalized tree-based synchronization algorithm and support unified collective operations at both inter-node and intra-node levels. Overcoming these runtime challenges are a first step towards effective programming on extreme-scale systems. AcknowledgmentsIt was an honor and a gift to have had Prof. Vivek Sarkar as my PhD advisor.Working with him has been a truly great learning experience for me.

show abstract

On the efficacy of GPU-integrated MPI for scientific applications

Cited by 20 publications

References 25 publications

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Design and evaluation of the gemtc framework for GPU-enabled many-task computing

Runtime Systems for Extreme Scale Platforms

Contact Info

Product

Resources

About