Mesa: A Memory-saving Training Framework for Transformers

Pan, Zengxi; Chen, Peng; He, Haoyu; Liu, Jing; Cai, Jianfei; Zhuang, Bohan

doi:10.48550/arxiv.2111.11124

Cited by 3 publications

(5 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”

Section: Data Selectionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

show abstract

Section: Data Selectionmentioning

confidence: 99%

“…The saved activations are then dequantized to the original precision in the backward pass to calculate gradients. Recent works [68,60] have been proposed to apply ACT to general frameworks supporting memory-efficient Transformer training.…”

Section: Memory Efficiencymentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…"swap" is a simple swapping strategy that swaps all activations to the CPU. For Bert-large, we also show the results on Mesa [7], a memory-saving resource-efficient training framework for transformers, and ZeRO-Offload [37], a highly optimized system for training large-scale language models. Gradient checkpointing uses the default checkpointing policy provided by the transformer library [38], where only the input to each transformer block is saved before the backward pass.…”

Section: Memory Saving and Computational Overheadmentioning

confidence: 99%

“…Although ACT has already demonstrated impressive compression capabilities, previous work on ACT is restricted to specific NN architectures. For example, ActNN [5] is a quantization framework for convolutional NNs only; Mesa [7] proposes a per head/layer quantization method for vision transformers; and AC-GC [6] derives convergence error bound for different types of operators separately.…”

Section: Introductionmentioning

confidence: 99%

GACT: Activation Compressed Training for Generic Network Architectures

Liu¹,

Zheng²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1×, enabling training with a 4.2× to 24.7× larger batch size, with negligible accuracy loss.

show abstract

“…As the amount of computational power required for training and inference increases, one of the more advanced neural networks is the Transformer [12], which has a deeper topology. Conversely, the depth of the transformer architecture gives rise to several constraints and challenges, including high computational complexity [13], substantial demands on computational resources [14], and high memory consumption [15] that quadratic to the input sequence length. Therefore, methods are required to achieve excellent performance, mainly when using them as translation machines.…”

Section: Introductionmentioning

confidence: 99%

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

Asriani,

Muchtadi-Alamsyah,

Purwarianti

2024

Mathematics

View full text Add to dashboard Cite

Transformer has emerged as one of the modern neural networks that has been applied in numerous applications. However, transformers’ large and deep architecture makes them computationally and memory-intensive. In this paper, we propose the block g-circulant matrices to replace the dense weight matrices in the feedforward layers of the transformer and leverage the DCT-DST algorithm to multiply these matrices with the input vector. Our test using Portuguese-English datasets shows that the suggested method improves model memory efficiency compared to the dense transformer but at the cost of a slight drop in accuracy. We found that the model Dense-block 1-circulant DCT-DST of 128 dimensions achieved the highest model memory efficiency at 22.14%. We further show that the same model achieved a BLEU score of 26.47%.

show abstract

Mesa: A Memory-saving Training Framework for Transformers

Cited by 3 publications

References 30 publications

A Survey on Efficient Training of Transformers

A Survey on Efficient Training of Transformers

GACT: Activation Compressed Training for Generic Network Architectures

On Block g-Circulant Matrices with Discrete Cosine and Sine Transforms for Transformer-Based Translation Machine

Contact Info

Product

Resources

About