An Efficient 2D Method for Training Super-Large Deep Learning Models

Xu, Qifan; Li, Shenggui; Gong, Chaoyu; You, Yang

doi:10.1109/ipdps54959.2023.00031

Cited by 10 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Model parallelism splits the model across multiple GPUs, each handling different stages. The model parallelism includes two categories: pipeline parallelism [42,72,83], placing individual layers on single GPUs, and tensor parallelism [28,30,45], dividing each tensor into chunks for specific GPUs.…”

Section: Distributed Trainingmentioning

confidence: 99%

“…On the other side, the rapid growth in the memory requirements of large-scale DNN models [2,75] has sparked the development of methods at the system-and algorithm-level to alleviate memory demands. Examples for these methods include recomputation [40,86], offloading [69], distributed training [28,30,42,45,72,83] and low-rank adaptation [29]. Even though these optimizations can effectively reduce memory footprint for training or fine-tuning large-scale DNN models, they may lead to poor memory utilization.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Guo,

Zhang,

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

View full text Add to dashboard Cite

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10×) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory * Equal contribution

show abstract