Extending MPI to accelerators

Stuart, Jeff A.; Balaji, Pavan; Owens, John D.

doi:10.1145/2377978.2377981

Cited by 15 publications

(8 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, MVAPICH2-GPU internally uses different implementations depending on whether the memory buffer is in the device memory or the host memory. Stuart et al have discussed cl int clEnqueueReadBuffer( cl command queue cmd, /* command queue */ cl mem buf, /* memory buffer */ cl bool blocking, / blocking */ size t offset, /* offset */ size t size, /* buffer size */ void* hbuf, /* buffer pointer */ cl uint numevts, /* the number of events in the list */ cl event* wlist, /* event list */ cl evett* evtret ) /* event object of event object */ Algorithm 1 various design options of MPI extension to support accelerators [7]. Gelado et al proposed GMAC that provides a single memory space shared by a CPU and a GPU and hence allows MPI functions to access device memory data [8].…”

Section: Related Workmentioning

confidence: 99%

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

Takizawa

Hirasawa

Sugawara

et al. 2015

Scientific Programming

View full text Add to dashboard Cite

In standard OpenCL programming, hosts are supposed to control their compute devices. Since compute devices are dedicated to kernel computation, only hosts can execute several kinds of data transfers such as internode communication and file access. These data transfers require one host to simultaneously play two or more roles due to the need for collaboration between the host and devices. The codes for such data transfers are likely to be system-specific, resulting in low portability. This paper proposes an OpenCL extension that incorporates such data transfers into the OpenCL event management mechanism. Unlike the current OpenCL standard, the main thread running on the host is not blocked to serialize dependent operations. Hence, an application can easily use the opportunities to overlap parallel activities of hosts and compute devices. In addition, the implementation details of data transfers are hidden behind the extension, and application programmers can use the optimized data transfers without any tricky programming techniques. The evaluation results show that the proposed extension can use the optimized data transfer implementation and thereby increase the sustained data transfer performance by about 18% for a real application accessing a big data file.

show abstract

Section: Related Workmentioning

confidence: 99%

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

Takizawa

Hirasawa

Sugawara

et al. 2015

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…Currently, only processes running on the CPU can perform MPI calls. Stuart et al [15] have suggested several mechanisms for extending the MPI standard to provide native support for accelerators. One significant proposal would allow GPU threads to obtain MPI ranks and participate directly in MPI communication [16].…”

Section: Related Workmentioning

confidence: 99%

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Aji

Dinan

et al. 2012

2012 IEEE 14th International Conference on High Performance Computing and Communication &Amp; 2012 IEEE 9th International Confe

Self Cite

View full text Add to dashboard Cite

Abstract-Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2D stencil application kernel.

show abstract

“…Recently, Stuart et al proposed several potential directions for extending the MPI standard to provide native support of these accelerators [13]. One significant propsed extension is to allow accelerators to obtain MPI ranks and participate directly in MPI operations.…”

Section: Related Workmentioning

confidence: 99%

Efficient Intranode Communication in GPU-Accelerated Systems

Aji

Dinan

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

Self Cite

View full text Add to dashboard Cite

Abstract-Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

show abstract

Extending MPI to accelerators

Cited by 15 publications

References 5 publications

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

Optimized Data Transfers Based on the OpenCL Event Management Mechanism

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Efficient Intranode Communication in GPU-Accelerated Systems

Contact Info

Product

Resources

About