Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Wahib, Mohamed; Zhang, Haoyu; Nguyen, Truong Thao; Drozd, Aleksandr; Domke, Jens; Zhang, Lingqi; Takano, Ryousei; Matsuoka, Satoshi

doi:10.1109/sc41405.2020.00023

Cited by 11 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these single GPU systems are not designed to cope with challenges stemming from pipeline parallelism ( §2). A recent study [54] partially offloads the recompute overhead to the CPU processors. This work is complementary to VPIPE and can be integrated into VPIPE to further reduce the recompute overhead.…”

Section: Related Workmentioning

confidence: 99%

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Zhao

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU's physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. VPIPE is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. VPIPE has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. VPIPE improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4%-463.4% and 24.8%-291.3% on various large DNNs and training settings.

show abstract

Section: Related Workmentioning

confidence: 99%

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Zhao

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…These algorithms move data back and forth between the CPU and the GPU to free up space on the GPU. KARMA [47] is a framework built over PyTorch that extends this out-of-core approach to data parallelism on multiple GPUs. They design an efficient algorithm for automatic offloading and prefetching of activations and parameters of the neural network to and from the CPU DRAM.…”

Section: A Data Parallelismmentioning

confidence: 99%

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

Nichols¹,

Singh²,

Lin³

et al. 2021

Preprint

View full text Add to dashboard Cite

The field of deep learning has witnessed a remarkable shift towards extremely compute-and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.

show abstract

“…These nodes include: NVIDIA Tesla A100, Google TPU, or Intel GAUDI. On such nodes, training efficiency depends on model parallelization (HoroVod [24] and KARMA [28]) and effective communications between accelerators performed by a specialized network such as NVIDIA NVLink [16]. Now, despite these innovative designs, the use of increasingly deep and dense network topologies has made the resources available for training still a problem, particularly memory capacity.…”

Section: Related Workmentioning

confidence: 99%

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Barrio-Parra

Barrios

Denneulin

2021

JSFI

View full text Add to dashboard Cite

Deep neural networks (DNNs) have grown in popularity in recent years thanks to the increase in computing power and the size and relevance of data sets. This has made it possible to build more complex models and include more areas of research and application. At the same time, the amount of data generated during the training process of these models puts great pressure on the capacity and bandwidth of the memory subsystem and, as a direct consequence, has become one of the biggest bottlenecks for the scalability of neural networks. Therefore, the optimizing of the workloads produced by DNNs in the memory subsystem requires a detailed understanding of access to the memory and the interactions between the processor, accelerator devices, and the system memory hierarchy. However, contrary to what would be expected, most DNN profilers work at a high level, so they only perform an analysis of the model and individual layers of the network leaving aside the complex interactions between all the hardware components involved in the training. This article shows the characterization performed using a convolutional neural network implemented in the two most popular frameworks: TensorFlow and Pytorch. Likewise, the behavior of the component interactions is discussed by varying the batch size for two sets of synthetic data and showing the results obtained by the profiler created for the study. Moreover, the results obtained when evaluating the AlexNet version on TensorFlow and its similarity in behavior when using a basic CNN are included.

show abstract

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Cited by 11 publications

References 20 publications

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Contact Info

Product

Resources

About