Reducing Activation Recomputation in Large Transformer Models

Korthikanti, Vijay Anand; Casper, Jared; Lym, Sangkug; McAfee, Lawrence; Andersch, Michael; Shoeybi, Mohammad; Catanzaro, Bryan

doi:10.48550/arxiv.2205.05198

Cited by 4 publications

(9 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baselines. We took Megatron v3.0 (Megatron-SP) [28], Megatron v2.5 (Megatron-PTD) [8], DeepSpeed 3D (DSpeed3D) [19], and DeepSpeed ZeRO3 (DSpeedZ3) [29] as our baselines. Megatron-SP is the latest 3D parallel training system that was reported to achieve almost linear scaling efficiency.…”

Section: Discussionmentioning

confidence: 99%

“…We ran these two DeepSpeed systems in DeepSpeed v0.5.5 environment. Sequence parallelism [28] was integrated into Megatron, DSpeed3D and FOLD3D to reduce the activation size and support larger models.…”

Section: Discussionmentioning

confidence: 99%

“…We compared FOLD3D against Megatron-SP [28] (v3.0.2, the latest release), Megatron-PTD [8] (v2.5.0), DeepSpeed Zero3 [29] (DSpeedZ3), and DeepSpeed 3D [19] (DSpeed3D), covering three notable and open source 3D training systems and one state-of-the-art data parallel training system (DSpeedZ3). Both FOLD3D and Megatron-SP are enabled with sequence parallelism (Section V), one of the latest memory squeezing techniques [28] complementary to 3D parallelism. Our evaluation was done both over a high-profile cluster (256 A100 GPUs) and a middle-profile cluster (64 V100 GPUs).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Zhao

Qing

et al. 2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Training a large DNN (e.g., GPT3) efficiently on commodity clouds is challenging even with the latest 3D parallel training systems (e.g., Megatron v3.0). In particular, along the pipeline parallelism dimension, computational tasks that produce a whole DNN's gradients with multiple input batches should be concurrently activated; along the data parallelism dimension, a set of heavy-weight communications (for aggregating the accumulated outputs of computational tasks) is inevitably serialized after the pipelined tasks, undermining the training performance (e.g., in Megatron, data parallelism caused all GPUs idle for over 44% of the training time) over commodity cloud networks. To deserialize these communicational and computational tasks, we propose the AIAO scheduling (for 3D parallelism) which slices a DNN into multiple segments, so that the computational tasks processing the same DNN segment can be scheduled together, and the communicational tasks that synchronize this segment can be launched and overlapped (deserialized) with other segments' computational tasks. We realized this idea in our FOLD3D training system. Extensive evaluation shows FOLD3D eliminated most of the all-GPU 44% idle time in Megatron (caused by data parallelism), leading to 25.2%-42.1% training throughput improvement compared to four notable baselines over various settings; FOLD3D's high performance scaled to many GPUs.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Fold3D: Rethinking and Parallelizing Computational and Communicational Tasks in the Training of Large DNN Models

Zhao

Qing

et al. 2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Inspired by Einops [41], we apply annotations on tensor dimensions to indicate transformation strategies. The Categories Mechanisms Support SPMD Parallelism Data Parallelism [1] Sequence Parallelism [24] Transformer Parallelism [45] DAP [11] ZeRO [38] Sequence Parallelism [26] * Flexible Tensor Parallel [20,53,56] MPMD Parallelism 1F1B [45,50] GPipe [19] Chimera [27] PipeDream (Async) [33] × Terapipe [28] ×…”

Section: Methodsmentioning

confidence: 99%

“…In addition, various memory optimizations [10,18,23] have been adopted to exploit large-scale model training under GPU memory constraints. Recently, systems, e.g., Megatron-LM [24,34,45], DeepSpeed [38], Piper [46], Unity [51] and Alpa [61], combine multiple parallelisms and memory optimizations within one system to accelerate distributed DNN training. However, these solutions fall short in relying on empirical parallelism configurations and having limited execution scheduling choices.…”

Section: Related Workmentioning

confidence: 99%

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

Lin¹,

Miao²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs for execution. Due to the large search space, the contemporary parallelization plan generators often rely on empirical rules that couple transformation and scheduling, and fall short in exploring more flexible schedules that yield better memory usage and compute efficiency. This tension can be exacerbated by the emerging models with increasing complexity in their structure and model size.SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. Such a principled approach decouples multiple seemingly intertwined factors and enables the composition of highly flexible parallelization plans. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5× speedup compared to state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN models like Swin-Transformer and AlphaFold2, as well as welloptimized models like GPT-3.

show abstract