Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Aji, Ashwin M.; Balaji, Pavan; Dinan, James; Feng, Wu-chun; Thakur, Rajeev

doi:10.1109/ipdpsw.2013.256

Cited by 4 publications

(3 citation statements)

References 14 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With our design, one can simply implicitly denote ordering of MPI and GPU operations by associating GPU events or streams with MPI calls, and the MPI-ACC implementation applies different heuristics to synchronize and make efficient communication progress. We have shown in our prior work [26] that this approach improves productivity and performance, while being compatible with the MPI standard. Moreover, our approach introduces a lightweight runtime attribute check to each MPI operation, but the overhead is much less than with automatic detection, as shown in Figure 2.…”

Section: Mpi-acc's Datatype Attributes Approachmentioning

confidence: 92%

“…This method requires no modifications to the MPI interface. Also, we have shown previously that while their approach works well for standalone point-to-point communication, programmers have to explicitly synchronize between interleaved and dependent MPI and CUDA operations, thereby requiring significant programmer effort to achieve ideal performance [26]. Moreover, as shown in Figure 2, the penalty for runtime checking can be significant and is incurred by all operations, including those that require no GPU data movement at all.…”

Section: Api Designmentioning

confidence: 99%

See 1 more Smart Citation

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji¹,

Panwar²,

Ji³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-toend data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

show abstract

Section: Mpi-acc's Datatype Attributes Approachmentioning

confidence: 92%

Section: Api Designmentioning

confidence: 99%

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji¹,

Panwar²,

Ji³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Aji et al [22] examine GPU integrated MPI frameworks and discuss alternatives for buffer synchronization and ordering semantics. In particular, they discuss using MPI communicator or datatype attributes to pass semantic information to the runtime implementation.…”

Section: B Architectures and Relaxed Orderingmentioning

confidence: 99%

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Ibrahim

Hargrove

Iancu

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism; 2) message ordering constraints; 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one-and twosided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6× when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, it seems that exposing outof-order delivery at the application level is required for the next generation of programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput. I. INTRODUCTIONHardware vendors traditionally employed a combination of Remote Direct Memory Access (RDMA) and out-oforder packet delivery to provide communication throughput on large scale multicore systems. At the system level API (Application Programming Interface), messages were usually ordered to meet the semantic requirements of higher level abstractions. Recently, the Cray Gemini hardware and APIs started exposing multiple message ordering modes to its clients. The main contribution of this work is the evaluation of how well equipped to take advantage of hardware out-of-order message delivery are existing onesided and two-sided programming models or communication libraries. To our knowledge, ours is the first study to examine in detail the usage of this functionality in implementations of programming model abstractions.The predominant paradigm for the last twenty years has been the Message Passing Interface (MPI) with its two-sided synchronization semantics. The non-blocking Isend/IRecv communication primitives in MPI can internally take advantage of un-ordered messaging.One-sided communication has started to gain popularity roughly ten years ago, as showcased by the Unified Parallel C [1] programming language. Until recently, the UPC language standard provided only blocking communication and a relaxed order memory model to allow compiler or runtime optimizations. In Nov 2012, both the UPC 1.3 (draft) [2] and the MPI 3.0 specifications [3] introduced user level nonblocking one-sided communication primitives and brought to

show abstract