Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

Beaumont, Olivier; Eyraud-Dubois, Lionel; Shilova, Alena

doi:10.1007/978-3-030-57675-2_10

Cited by 12 publications

(10 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”

Section: Data Selectionmentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

show abstract

Section: Data Selectionmentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The work presented in [54] combines rematerialization to trade memory for computation time and offloading to trade memory for data movement. It employs a dynamic programming heuristic to determine the optimal offloading sequence.…”

Section: Further Analysis 1) Training Efficiencymentioning

confidence: 99%

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Sun

Wang

Qiu

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires highperformance GPU servers that are too expensive to purchase and maintain. We present STRONGHOLD, a novel approach for enabling large DNN model training with no change to the user code. STRONGHOLD scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.

show abstract

“…• Offloading [6]: Offloading network activations from accelerator to system memory. Whenever the back-propagation process requires a set of activations, they are transferred back from system to accelerator memory.…”

Section: Memory Workaroundsmentioning

confidence: 99%

“…At the same time, current accelerators (e.g., GPUs, TPUs) are rather limited in terms of memory capacity, although workarounds to load larger memories than the one offered by the device have already been proposed (as discussed in Section 1). These workarounds include model parallelism [3,10], activations re-computation [7] and offloading [6], enabling greater memory loads at the cost of computation efficiency. In this highmemory load context, avoiding accelerators and using CPU computation must be considered as a feasible alternative.…”

Section: Introductionmentioning

confidence: 99%

Size & Shape Matters: The Need of HPC Benchmarks of High Resolution Image Training for Deep Learning

Pont

Megias

Garcia-Gasulla

et al. 2021

JSFI

View full text Add to dashboard Cite

One of the purposes of HPC benchmarks is to identify limitations and bottlenecks in hardware. This functionality is particularly influential when assessing performance on emerging tasks, the nature and requirements of which may not yet be fully understood. In this setting, a proper benchmark can steer the design of next generation hardware by properly identifying said requirements, and quicken the deployment of novel solutions. With the increasing popularity of deep learning workloads, benchmarks for this family of tasks have been gaining popularity. Particularly for image based tasks, which rely on the most well established family of deep learning models: Convolutional Neural Networks. Significantly, most benchmarks for CNN use low-resolution and fixed-shape (LR&FS) images. While this sort of inputs have been very successful for certain purposes, they are insufficient for some domains of special interest (e.g., medical image diagnosis or autonomous driving) where one requires higher resolutions and variable-shape (HR&VS) images to avoid loss of information and deformation. As of today, it is still unclear how does image resolution and shape variability affect the nature of the problem from a computational perspective. In this paper we assess the differences between training with LR&FS and HR&VS, as means to justify the importance of building benchmarks specific for the latter. Our results on three different HPC clusters show significant variations in time, resources and memory management, highlighting the differences between LR&FS and HR&VS image deep learning.

show abstract

Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

Cited by 12 publications

References 19 publications

A Survey on Efficient Training of Transformers

A Survey on Efficient Training of Transformers

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Size & Shape Matters: The Need of HPC Benchmarks of High Resolution Image Training for Deep Learning

Contact Info

Product

Resources

About