Feng Ji scite author profile

Abstract-Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We present a detailed design of our new approach to intranode communication and evaluate its improvement to communication and application performance using micro-kernel benchmarks and a 2D stencil application kernel.

show abstract

On the efficacy of GPU-integrated MPI for scientific applications

Aji

Panwar

et al. 2013

View full text Add to dashboard Cite

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU memories, thus reducing both efficiency and productivity. Consequently, GPU-integrated MPI solutions, such as MPI-ACC and MVAPICH2-GPU, have been developed that provide unified programming interfaces and optimized implementations for end-to-end data communication among CPUs and GPUs. To date, however, there lacks an in-depth performance characterization of the new optimization spaces or the productivity impact of such GPU-integrated communication systems for scientific applications.In this paper, we study the efficacy of GPU-integrated MPI on scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We use MPI-ACC as an example implementation and demonstrate how the programmer can seamlessly choose between either the CPU or the GPU as the logical communication end point, depending on the application's computational requirements. MPI-ACC also encourages programmers to explore novel application-specific optimizations, such as internode CPU-GPU communication with concurrent CPU-GPU computations, which can improve the overall cluster utilization. Furthermore, MPI-ACC internally implements scalable memory management techniques, thereby decoupling the low-level memory optimizations from the applications and making them scalable and portable across several architectures. Experimental results from a state-of-the-art cluster with hundreds of GPUs show that the MPI-ACC-driven new applicationspecific optimizations can improve the performance of an epidemiology simulation by up to 61.6% and the performance of a seismology modeling application by up to 44%, when compared with traditional hybrid MPI+GPU implementations. We conclude that GPU-integrated MPI significantly enhances programmer producPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

show abstract

A Sequential Response Model for Analyzing Process Data on Technology-Based Problem-Solving Tasks

Han

Liu

2021

Multivariate Behavioral Research

View full text Add to dashboard Cite

Valve saturable reactor iron losses of ultrahigh‐voltage direct current converter

Cao¹,

Ji²,

Liu³

et al. 2016

IET Science, Measurement & Technology

View full text Add to dashboard Cite

Psychometric properties and gender invariance of the simplified Chinese version of Night Eating Questionnaire in a large sample of mainland Chinese college students

Zhang

et al. 2018

Eat Weight Disord

View full text Add to dashboard Cite

show abstract

Temperamental Shyness and Anger/Frustration in Childhood: Normative Development, Individual Differences, and the Impacts of Maternal Intrusiveness and Frontal Electroencephalogram Asymmetry

Liu

Phillips

et al. 2021

Child Development

View full text Add to dashboard Cite

This study used latent growth curve modeling to identify normative development and individual differences in the developmental patterns of shyness and anger/frustration across childhood. This study also examined the impacts of maternal intrusiveness and frontal electroencephalogram (EEG) asymmetry at age 4 on the developmental patterns of shyness and anger/frustration. 180 children (92 boys, 88 girls; M age = 4.07 years at baseline; 75.6% White, 18.3% Black, 6.1% multiracial/other) participated in the study. Normative development included significant linear decreases in shyness and anger/frustration. Individual variation existed in the developmental patterns. Children with left frontal EEG asymmetry showed a faster decreasing pattern of shyness. Children who experienced higher maternal intrusiveness and had left frontal EEG asymmetry showed a slower decreasing pattern of anger/frustration.

show abstract

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Aji¹,

Panwar²,

Ji³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-toend data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI-ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Feng Ji

Using Shared Memory to Accelerate MapReduce on Graphics Processing Units

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

On the efficacy of GPU-integrated MPI for scientific applications

A Sequential Response Model for Analyzing Process Data on Technology-Based Problem-Solving Tasks

Valve saturable reactor iron losses of ultrahigh‐voltage direct current converter

Psychometric properties and gender invariance of the simplified Chinese version of Night Eating Questionnaire in a large sample of mainland Chinese college students

Temperamental Shyness and Anger/Frustration in Childhood: Normative Development, Individual Differences, and the Impacts of Maternal Intrusiveness and Frontal Electroencephalogram Asymmetry

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Contact Info

Product

Resources

About