MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji, Ashwin M.; Panwar, Lokendra S.; Ji, Feng; Murthy, Karthik; Chabbi, Milind; Balaji, Pavan; Bisset, Keith R.; Dinan, James; Feng, Wu-chun; Mellor-Crummey, John; Ma, Xiaosong; Thakur, Rajeev

doi:10.1109/tpds.2015.2446479

Cited by 16 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A discussion on MPI design choices and a comprehensive optimization of the data pipelining and buffer management has been provided by Aji et al [9]. The study investigated the efficiency of MPI-ACC application in the scientific field, mainly in the field of epidemiology and outlaid the lessons learned and the tradeoffs.…”

Section: Related Workmentioning

confidence: 99%

A novel approach for big data processing using message passing interface based on memory mapping

2019

View full text Add to dashboard Cite

The analysis and processing of big data are one of the most important challenges that researchers are working on to find the best approaches to handle it with high performance, low cost and high accuracy. In this paper, a novel approach for big data processing and management was proposed that differed from the existing ones; the proposed method employs not only the memory space to reads and handle big data, it also uses space of memory-mapped extended from memory storage. From a methodological viewpoint, the novelty of this paper is the segmentation stage of big data using memory mapping and broadcasting all segments to a number of processors using a parallel message passing interface. From an application viewpoint, the paper presents a high-performance approach based on a homogenous network which works parallelly to encrypt-decrypt big data using AES algorithm. This approach can be done on Windows Operating System using .NET libraries.

show abstract

Section: Related Workmentioning

confidence: 99%

A novel approach for big data processing using message passing interface based on memory mapping

2019

View full text Add to dashboard Cite

show abstract

“…Accelerator-based computing: Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting endto-end data movement among CUDA, OpenCL and CPU memory spaces [19]. An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy.…”

Section: Related Workmentioning

confidence: 99%

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

Vogel

Marongiu

Benini

2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.

show abstract

“…Our previous work found that the data marshaling phase performs better when they are implemented on the GPU itself rather than the CPU [28]. To accomplish MPI communication directly from the OpenCL device, we used MPI-ACC [28], a GPU-aware MPI framework-based on the MPICH MPI implementation [29]-that performs point-to-point communication among OpenCL devices across the network. Moreover, as a consequence to performing data marshaling on the device, the host-device bulk data transfers before and after each velocitystress computation kernel are completely avoided.…”

Section: Mpi+opencl Implementation For Multiple Nodesmentioning

confidence: 99%

MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL

et al. 2016

Self Cite

View full text Add to dashboard Cite

The OpenCL specification tightly binds a command queue to a specific device. For best performance, the user has to find the ideal queuedevice mapping at command queue creation time, an effort that requires a thorough understanding of the underlying device architectures and kernels in the program. In this paper, we propose to add scheduling attributes to the OpenCL context and command queue objects that can be leveraged by an intelligent runtime scheduler to automatically perform ideal queuedevice mapping. Our proposed extensions enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and to automatically gain performance without sacrificing programmability. As an example, we design and implement an OpenCL runtime for task-parallel workloads, called MultiCL, which efficiently schedules command queues across devices. Our case studies include the SNU benchmark suite and a real-world seismology simulation. To benefit from our runtime optimizations, users have to apply our proposed scheduler extensions to only four source lines of code, on average, in existing OpenCL applications. We evaluate both single-node and multinode experiments and also compare with SOCL, our closest related work. We show that MultiCL maps command queues to the optimal device set in most cases with negligible runtime overhead.

show abstract

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Cited by 16 publications

References 22 publications

A novel approach for big data processing using message passing interface based on memory mapping

A novel approach for big data processing using message passing interface based on memory mapping

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL

Contact Info

Product

Resources

About