Dapple

Fan, Shiqing; Rong, Yi; Meng, Chen; Cao, Zongyan; Wang, Siyu; Zheng, Zishan; Wu, Chuan; Long, Guangcheng; Yang, Jun; Xia, Lixue; Diao, Lansong; Liu, Xiaoyong; Lin, Wei

doi:10.1145/3437801.3441593

Cited by 97 publications

(20 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pipeline parallelism splits a mini-batch into smaller microbatches and pipelines them to the DNN model stages hosted on different workers so that workers can process different micro-batches simultaneously [12], [17], [24], [25]. Point-topoint communication is performed between workers hosting neighbor stages to transfer intermediate activations.…”

Section: Distributed Dnn Trainingmentioning

confidence: 99%

“…next iteration. Despite better model accuracy, pipeline flush causes worker idling (i.e., bubbles) in pipeline execution [5], [12], [24], [25]. For GPipe and 1F1B, the ratio of the bubble time is (p − 1)/(m + p − 1), where p is the number of stages and m is the number of micro-batches [5].…”

Section: Distributed Dnn Trainingmentioning

confidence: 99%

“…Recent works combine the three parallelism paradigms, called 3D parallelism [2], [5], [12], [26], [27]. Figure 2 shows a hand-optimized parallelism plan in Megatron-LM [2], [5], a state-of-the-art training system for transformer language models.…”

Section: Distributed Dnn Trainingmentioning

confidence: 99%

“…However, replicas are not always available, even with data parallelism. For instance, some prior works advocate data parallelism only across multiple GPUs on the same machine to leverage high-speed intra-server interconnects such as NVLink [5], [12] to accelerate gradient synchronization. All replicas would be lost in the event of a machine failure.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Swift

Zhong

Sheng

Liu³

et al. 2023

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.

show abstract

Section: Distributed Dnn Trainingmentioning

confidence: 99%

Section: Distributed Dnn Trainingmentioning

confidence: 99%

Section: Distributed Dnn Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Swift

Zhong

Sheng

Liu³

et al. 2023

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…A recent line of work adds pipelines into model parallelism by partitioning model layers into parallel stages [7], [8], [45], [46], [47], [48], [49]. In this way, each training batch is divided into micro-batches to be processed by pipeline stages across computing devices.…”

Section: Further Analysis 1) Training Efficiencymentioning

confidence: 99%

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Sun

Wang

Qiu

et al. 2022

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires highperformance GPU servers that are too expensive to purchase and maintain. We present STRONGHOLD, a novel approach for enabling large DNN model training with no change to the user code. STRONGHOLD scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.

show abstract

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Dreuning

Bal

Nieuwpoort

2022

Euro-Par 2022: Parallel Processing

View full text Add to dashboard Cite

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

show abstract

Dapple

Cited by 97 publications

References 16 publications

Swift

Swift

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Contact Info

Product

Resources

About