Efficient Large-Scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster

Kim, Youngrang; Choi, Hyeonseong; Lee, John J.; Kim, Jik-Soo; Jei, Hyunseung; Roh, Hyun Seog

doi:10.1109/fas-w.2019.00050

Cited by 13 publications

(7 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [25], a distributed deep learning framework for a heterogeneous multi-GPU cluster combines the advantages of All-reduce and parameter-server methods. In addition, the proposed design performs significant mini-batch training asynchronously to increase the overall utilization of available computing power in the cluster.…”

Section: Related Workmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

View full text Add to dashboard Cite

Due to its fantastic performance in the quality of the images created, Generator Adversarial Networks have recently become a viable option for image reconstruction. The main problem with employing GAN is how expensive the computations are. Researchers have developed techniques for distributing GANs across multiple nodes. However, these techniques typically do not scale because they frequently separate the components (Discriminator and Generator), leading to high communication overhead or encountering distribution-related problems unique to GAN training. In this study, the training procedure for the GAN is parallelized and carried out over many Graphical Processing Units (GPUs). TensorFlow's built-in logic and a custom loop were tweaked for more control over the resources allotted to each GPU worker. In this study, GPU image processing improvements and multi-GPU learning are used. The GAN model is accelerated using Distributed TensorFlow with synchronous data-parallel training on a single system and several GPUs. Acceleration was accomplished using the Genesis Cloud Platform and the NVIDIA®GeForceTM GTX 108 GPU accelerator. The speed-up of 1.322 for two GPUs, 1.688 for three GPUs, and 1.7792 for four GPUs using multi-GPU acceleration. The parameter server model's data initialization and image production bottlenecks are removed, but the results' speed-up is not linear. Increasing the number of GPUs and removing the connectivity constraint will accelerate things even more. The bottlenecks are detected using new network lines and resources, and solutions are suggested. Recomputation and quantization are the two techniques to reduce the amount of GPU acceleration in memory. Deployment and versioning are essential for successfully operating multi-node GAN models in MLflow. Properly deploying and versioning these models can improve scalability, reproducibility, and collaboration across teams working on the same model. MLflow provides built-in tools for versioning and tracking model performance, making it easier to manage multiple versions of the model and reproduce it in different environments.

show abstract

Section: Related Workmentioning

confidence: 99%

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Ravikumar

Sriraman

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The distributed deep learning methods can be two-fold: data parallelization and model parallelization. In the case of data parallelization, the deep learning model is replicated across multiple workers [27,28]. Each worker trains a deep learning model with different input data.…”

Section: Distributed Deep Learningmentioning

confidence: 99%

Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

2020

View full text Add to dashboard Cite

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based Message Passing Interface (MPI) library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

show abstract

“…To achieve high accuracy, it is necessary to use large deep learning models and large datasets are needed to improve the generalization capabilities of the models. Training large-scale models with massive datasets is difficult due to the limited GPU memory size [1][2][3][4][5][6]. Distributed deep learning using multi-GPU/node can efficiently train large-scale models.…”

Section: Introductionmentioning

confidence: 99%

Towards accelerating model parallelism in distributed deep learning systems

Choi,

Lee,

Chun

et al. 2023

PLoS ONE

Self Cite

View full text Add to dashboard Cite

Modern deep neural networks cannot be often trained on a single GPU due to large model size and large data size. Model parallelism splits a model for multiple GPUs, but making it scalable and seamless is challenging due to different information sharing among GPUs with communication overhead. Specifically, we identify two key issues to make the parallelism being inefficient and inaccurate; an efficient pipelining technique is crucial to maximize GPU utilization and normalizations in deep neural networks may affect the performance due to different statistics sharing of mini-batch. In this work, we address these issues by investigating efficient pipelining for model parallelism and effective normalizations in model / data parallelisms when training a model with large mini-batch in multiple GPUs so that the model performance in accuracy can not be compromised. Firstly, we propose a novel method to search for an optimal micro-batch size considering the number of GPUs and memory size for model parallelism. For efficient pipelining, mini-batch is usually divided into smaller batches (called micro-batch). To maximize the utilization of GPU computing resources, training should be performed with the optimal micro-batch size. Our proposed micro-batch size search algorithm achieved increased image throughput by up to 12% and improved trainable mini-batch size by 25% as compared to the conventional model parallelism method. Secondly, we investigate normalizations in distributed deep learning training for different parallelisms. Our experiments using different normalization methods suggested that the performance with batch normalization can be improved by sharing the batch information among GPUs when performing data parallelism. It was also confirmed that group normalization helped minimizing accuracy degradation when performing model parallelism with pipelining and yielded consistent accuracies for diverse mini-batch sizes.

show abstract

Efficient Large-Scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster

Cited by 13 publications

References 1 publication

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Computationally Efficient Neural Rendering for Generator Adversarial Networks Using a Multi-GPU Cluster in a Cloud Environment

Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Towards accelerating model parallelism in distributed deep learning systems

Contact Info

Product

Resources

About