GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Jain, Arpan; Awan, Ammar Ahmad; Aljuhani, Asmaa; Hashmi, Jahanzeb Maqbool; Anthony, Quentin; Subramoni, Hari; Panda, D.K.; Machiraju, Raghu; Parwani, Anil V.

doi:10.1109/sc41405.2020.00049

Cited by 28 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Machine learning (ML) applications are playing an increasingly important role in modern high-performance computing (HPC) systems. Besides the optimization of gigantic neural network model training [3,8,17,25,29], HPC plus artificial intelligence (AI) in solving scientific problems are gaining momentum [5,20,22,23]. One example is the machinelearning molecular dynamics (MLMD), which aims to bridge the gap between first-principles accuracy and Newtonian MD efficiency [9,37].…”

Section: Introductionmentioning

confidence: 99%

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Guo

Yan

et al. 2022

Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

High-performance computing, together with a neural network model trained from data generated with first-principles methods, has greatly boosted applications of ab initio molecular dynamics in terms of spatial and temporal scales on modern supercomputers. Previous state-of-the-art can achieve 1 − 2 nanoseconds molecular dynamics simulation per day for 100-million atoms on the entire Summit supercomputer. In this paper, we have significantly reduced the memory footprint and computational time by a comprehensive approach with both algorithmic and system innovations. The neural network model is compressed by model tabulation, kernel fusion, and redundancy removal. Then optimizations such as acceleration of customized kernel, tabulation of activation function, MPI+OpenMP parallelization are implemented on GPU and ARM architectures. Testing results of the copper system show that the optimized code can scale up to the entire machine of both Fugaku and Summit, and the corresponding system size can be extended by a factor of 134 to an Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owner/author(s).

show abstract

Section: Introductionmentioning

confidence: 99%

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Guo

Yan

et al. 2022

Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…It is a synchronous weight update technique that schedules backward passes of each micro-batch as early as possible to release the memory occupied by activations. Gems [Jain et al, 2020a] and Chimera [Li and Hoefler, 2021] implement bidirectional pipelines, where each GPU is used for two pipeline stages (i and P − i, P is the number of stages). The design of Gems is mostly concerned with activations memory: the forward pass of the next micro-batch starts after the first backward stage of the previous micro-batch is computed and activations memory is released.…”

Section: Offloading Of Weightsmentioning

confidence: 99%

“…Het-Pipe [Park et al, 2020] addresses the additional problem of heterogeneous GPUs by grouping them into virtual workers and running pipeline parallelism within each virtual worker, while relying on data parallelism between workers. Varuna [Athlur et al, 2021] [Park et al, 2020] DP, PP Parameter Server LinProg for PP Pipe-torch [Zhan and Zhang, 2019] DP, PP Async Update DynProg for DP, PP, GPU allocation Varuna [Athlur et al, 2021] DP, PP Opportunistic Backward Scheduling Heuristic PP partition, Bruteforce for DP, PP depth Gems [Jain et al, 2020a] DP, PP Bidirectional Pipeline -Chimera [Li and Hoefler, 2021] DP,PP 1F1B, Bidirectional Pipeline Greedy mini-batch size, Bruteforce for DP, PP depth tivation recomputations and respective backward passes are scheduled opportunistically.…”

Section: Several Papers Specifically Target Challenging Topologiesmentioning

confidence: 99%

Survey on Large Scale Neural Network Training

Gusak¹,

Cherniuk²,

Shilova³

et al. 2022

Preprint

View full text Add to dashboard Cite

Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. Hence, many models don't fit one GPU device or can be trained using only a small per-GPU batch size. This survey provides a systematic overview of the approaches that enable more efficient DNNs training. We analyze techniques that save memory and make good use of computation and communication resources on architectures with a single or several GPUs. We summarize the main categories of strategies and compare strategies within and across categories. Along with approaches proposed in the literature, we discuss available implementations.

show abstract

“…GPU-Enabled Memory-Aware Model-Parallelism System (GEMS) has been proposed to train large-scale deep learning models using high-resolution images, which are mainly used in digital pathology [23]. In their paper, four types of techniques are proposed: GEMS-Basic, GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid.…”

Section: Related Workmentioning

confidence: 99%

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

Choi

Lee

2021

Applied Sciences

View full text Add to dashboard Cite

To achieve high accuracy when performing deep learning, it is necessary to use a large-scale training model. However, due to the limitations of GPU memory, it is difficult to train large-scale training models within a single GPU. NVIDIA introduced a technology called CUDA Unified Memory with CUDA 6 to overcome the limitations of GPU memory by virtually combining GPU memory and CPU memory. In addition, in CUDA 8, memory advise options are introduced to efficiently utilize CUDA Unified Memory. In this work, we propose a newly optimized scheme based on CUDA Unified Memory to efficiently use GPU memory by applying different memory advise to each data type according to access patterns in deep learning training. We apply CUDA Unified Memory technology to PyTorch to see the performance of large-scale learning models through the expanded GPU memory. We conduct comprehensive experiments on how to efficiently utilize Unified Memory by applying memory advises when performing deep learning. As a result, when the data used for deep learning are divided into three types and a memory advise is applied to the data according to the access pattern, the deep learning execution time is reduced by 9.4% compared to the default Unified Memory.

show abstract

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Cited by 28 publications

References 12 publications

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

Survey on Large Scale Neural Network Training

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

Contact Info

Product

Resources

About