2022
DOI: 10.48550/arxiv.2205.05198
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reducing Activation Recomputation in Large Transformer Models

Abstract: Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnece… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(9 citation statements)
references
References 3 publications
0
8
0
Order By: Relevance
“…Baselines. We took Megatron v3.0 (Megatron-SP) [28], Megatron v2.5 (Megatron-PTD) [8], DeepSpeed 3D (DSpeed3D) [19], and DeepSpeed ZeRO3 (DSpeedZ3) [29] as our baselines. Megatron-SP is the latest 3D parallel training system that was reported to achieve almost linear scaling efficiency.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Baselines. We took Megatron v3.0 (Megatron-SP) [28], Megatron v2.5 (Megatron-PTD) [8], DeepSpeed 3D (DSpeed3D) [19], and DeepSpeed ZeRO3 (DSpeedZ3) [29] as our baselines. Megatron-SP is the latest 3D parallel training system that was reported to achieve almost linear scaling efficiency.…”
Section: Discussionmentioning
confidence: 99%
“…We ran these two DeepSpeed systems in DeepSpeed v0.5.5 environment. Sequence parallelism [28] was integrated into Megatron, DSpeed3D and FOLD3D to reduce the activation size and support larger models.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Inspired by Einops [41], we apply annotations on tensor dimensions to indicate transformation strategies. The Categories Mechanisms Support SPMD Parallelism Data Parallelism [1] Sequence Parallelism [24] Transformer Parallelism [45] DAP [11] ZeRO [38] Sequence Parallelism [26] * Flexible Tensor Parallel [20,53,56] MPMD Parallelism 1F1B [45,50] GPipe [19] Chimera [27] PipeDream (Async) [33] × Terapipe [28] ×…”
Section: Methodsmentioning
confidence: 99%
“…In addition, various memory optimizations [10,18,23] have been adopted to exploit large-scale model training under GPU memory constraints. Recently, systems, e.g., Megatron-LM [24,34,45], DeepSpeed [38], Piper [46], Unity [51] and Alpa [61], combine multiple parallelisms and memory optimizations within one system to accelerate distributed DNN training. However, these solutions fall short in relying on empirical parallelism configurations and having limited execution scheduling choices.…”
Section: Related Workmentioning
confidence: 99%