2021
DOI: 10.48550/arxiv.2111.11124
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mesa: A Memory-saving Training Framework for Transformers

Abstract: There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memorysaving resource-efficient training framework for Transformers. Specifically, Mesa uses exact activations during forward pass w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 30 publications
0
5
0
Order By: Relevance
“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”
Section: Data Selectionmentioning
confidence: 99%
See 1 more Smart Citation
“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”
Section: Data Selectionmentioning
confidence: 99%
“…The saved activations are then dequantized to the original precision in the backward pass to calculate gradients. Recent works [68,60] have been proposed to apply ACT to general frameworks supporting memory-efficient Transformer training.…”
Section: Memory Efficiencymentioning
confidence: 99%
“…"swap" is a simple swapping strategy that swaps all activations to the CPU. For Bert-large, we also show the results on Mesa [7], a memory-saving resource-efficient training framework for transformers, and ZeRO-Offload [37], a highly optimized system for training large-scale language models. Gradient checkpointing uses the default checkpointing policy provided by the transformer library [38], where only the input to each transformer block is saved before the backward pass.…”
Section: Memory Saving and Computational Overheadmentioning
confidence: 99%
“…Although ACT has already demonstrated impressive compression capabilities, previous work on ACT is restricted to specific NN architectures. For example, ActNN [5] is a quantization framework for convolutional NNs only; Mesa [7] proposes a per head/layer quantization method for vision transformers; and AC-GC [6] derives convergence error bound for different types of operators separately.…”
Section: Introductionmentioning
confidence: 99%
“…As the amount of computational power required for training and inference increases, one of the more advanced neural networks is the Transformer [12], which has a deeper topology. Conversely, the depth of the transformer architecture gives rise to several constraints and challenges, including high computational complexity [13], substantial demands on computational resources [14], and high memory consumption [15] that quadratic to the input sequence length. Therefore, methods are required to achieve excellent performance, mainly when using them as translation machines.…”
Section: Introductionmentioning
confidence: 99%