Hyper-Q aware intranode MPI collectives on the GPU

Faraji, Iman; Afsahi, Ahmad

doi:10.1145/2832241.2832247

Cited by 4 publications

(4 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The work in this paper extends our prior study 12,13 in different ways. While our collective designs in our other work 12 target a single node with a single GPU, in this paper, we extend our work and propose a three-level hierarchical framework for GPU collectives for clusters with multi-GPU nodes.…”

supporting

confidence: 63%

“…We evaluate different combinations of our algorithms in the proposed framework and discuss our findings. In addition, this paper extends the proposed algorithms in our other work from a single GPU to across the clusters and provides an extended evaluation of using different copy types for collective operations against a wider set of alternative designs. Our experimental results highlight the importance of efficiently using the right copy type in GPU collective operations; this observation is further investigated and discussed in this paper by providing some profiling results.…”

Section: Introductionmentioning

confidence: 92%

See 1 more Smart Citation

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary GPU accelerators have established themselves in the state‐of‐the‐art clusters by offering high performance and energy efficiency. In such systems, efficient inter‐process GPU communication is of paramount importance to application performance. This paper investigates various algorithms in conjunction with the latest GPU features to improve GPU collective operations. First, we propose a GPU Shared Buffer‐aware (GSB) algorithm and a Binomial Tree Based (BTB) algorithm for GPU collectives on single‐GPU nodes. We then propose a hierarchical framework for clusters with multi‐GPU nodes. By studying various combinations of algorithms, we highlight the importance of choosing the right algorithm within each level. The evaluation of our framework on MPI_Allreduce shows promising performance results for large message sizes. To address the shortcoming for small and medium messages, we present the benefit of using the Hyper‐Q feature and the MPS service in jointly using CUDA IPC and host‐staged copy types to perform multiple inter‐process communications. However, we argue that efficient designs are still required to further harness this potential. Accordingly, we propose a static and a dynamic algorithm for MPI_Allgather and MPI_Allreduce and present their effectiveness on various message sizes. Our profiling results indicate that the achieved performance is mainly rooted in overlapping different copy types.

show abstract

supporting

confidence: 63%

Section: Introductionmentioning

confidence: 92%

Design considerations for GPU‐aware collective communications in MPI

Faraji

Afsahi

2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recent work [56,44,55,31] leverage CUDA IPC in order to improve various intra-node and inter-node MPI collectives of a single process/application, and thus facilitate the porting to, and improve the performance of HPC applications on GPUs. MVAPICH2 [53], for instance, supports the use of MPI calls directly over GPU memory.…”

Section: Related Workmentioning

confidence: 99%

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service

Dakkak

Gonzalo

et al. 2019

2019 IEEE 12th International Conference on Cloud Computing (CLOUD)

View full text Add to dashboard Cite

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines: including image recognition, object detection, natural language processing, speech synthesis, and personalized recommendation pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure for both enterprise and consumer applications, has to be able to handle user-defined pipelines of diverse DNN inference workloads while maintaining isolation and latency guarantees, and minimizing resource waste. The current solution for guaranteeing isolation within FaaS is suboptimal -suffering from "cold start" latency. A major cause of such inefficiency is the need to move large amount of model data within and across servers. We propose TrIMS as a novel solution to address these issues. Our proposed solution consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24× speedup in latency for image classification models and up to 210× speedup for large models. We achieve up to 8× system throughput improvement.

show abstract

“…The result of calculating Euclidean distance is continued in the second kernel using Parallel Reduce Interleaved Address method. This method is selected because it can complete the summation in the array [16]. In this second kernel the number of threads used is the same as the number of data features.…”

Section: Finding Bmumentioning

confidence: 99%

Enhancing Performance of Parallel Self-Organizing Map on Large Dataset with Dynamic Parallel and Hyper-Q

Sibero

Sitompul

Nasution

2018

jocai

View full text Add to dashboard Cite

Self-Organizing Map (SOM) is an unsupervised artificial neural network algorithm. Even though this algorithm is known to be an appealing clustering method,many efforts to improve its performance are still pursued in various research works. In order to gain faster computation time, for instance, running SOM in parallel had been focused in many previous research works. Utilization of the Graphics Processing Unit (GPU) as a parallel calculation engine is also continuously improved. However, total computation time in parallel SOM is still not optimal on processing large dataset. In this research, we propose a combination of Dynamic Parallel and Hyper-Q to further improve the performance of parallel SOM in terms of faster computing time. Dynamic Parallel and Hyper-Q are utilized on the process of calculating distance and searching best-matching unit (BMU), while updating weight and its neighbors are performed using Hyper-Q only. Result of this study indicates an increase in SOM parallel performance up to two times faster compared to those without using Dynamic Parallel and Hyper-Q.

show abstract

Hyper-Q aware intranode MPI collectives on the GPU

Cited by 4 publications

References 4 publications

Design considerations for GPU‐aware collective communications in MPI

Design considerations for GPU‐aware collective communications in MPI

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service

Enhancing Performance of Parallel Self-Organizing Map on Large Dataset with Dynamic Parallel and Hyper-Q

Contact Info

Product

Resources

About