2016
DOI: 10.1109/tpds.2015.2446479
|View full text |Cite
|
Sign up to set email alerts
|

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Abstract: Abstract-Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-toend data movement. We introduce MPI-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 16 publications
(6 citation statements)
references
References 22 publications
0
6
0
Order By: Relevance
“…A discussion on MPI design choices and a comprehensive optimization of the data pipelining and buffer management has been provided by Aji et al [9]. The study investigated the efficiency of MPI-ACC application in the scientific field, mainly in the field of epidemiology and outlaid the lessons learned and the tradeoffs.…”
Section: Related Workmentioning
confidence: 99%
“…A discussion on MPI design choices and a comprehensive optimization of the data pipelining and buffer management has been provided by Aji et al [9]. The study investigated the efficiency of MPI-ACC application in the scientific field, mainly in the field of epidemiology and outlaid the lessons learned and the tradeoffs.…”
Section: Related Workmentioning
confidence: 99%
“…Accelerator-based computing: Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting endto-end data movement among CUDA, OpenCL and CPU memory spaces [19]. An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy.…”
Section: Related Workmentioning
confidence: 99%
“…Our previous work found that the data marshaling phase performs better when they are implemented on the GPU itself rather than the CPU [28]. To accomplish MPI communication directly from the OpenCL device, we used MPI-ACC [28], a GPU-aware MPI framework-based on the MPICH MPI implementation [29]-that performs point-to-point communication among OpenCL devices across the network. Moreover, as a consequence to performing data marshaling on the device, the host-device bulk data transfers before and after each velocitystress computation kernel are completely avoided.…”
Section: Mpi+opencl Implementation For Multiple Nodesmentioning
confidence: 99%