Block Pruning For Faster Transformers

Lagunas, François; Charlaix, Ella; Sanh, Victor; Rushton, Gérard

doi:10.18653/v1/2021.emnlp-main.829

Cited by 61 publications

(55 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Storing such sparse matrices does not lead to immediate gains and sparse matrix multiplication is not always faster, especially on GPUs (Gale et al, 2020). As such, other work considers structured pruning of entire rows or columns of the matrices, which makes it much easier to realize efficiency gains (Fan et al, 2021;Lagunas et al, 2021). We explore an alternative structured pruning approach, rank pruning .…”

Section: Pruning Methodsmentioning

confidence: 99%

“…However, sparsifying a matrix can lead to specialized hardware and algorithmic optimizations as demonstrated by sparse multiplication libraries (Gale et al, 2020). Lagunas et al (2021) optimize element-wise unstructured pruning in a simple manner by removing entirely pruned rows, columns or attention heads. They show that even at high sparsities (more than 90%), this strategy achieves at most around a 1.5× speedup.…”

Section: Runtime Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Pruning Pretrained Encoders with a Multitask Objective

Xia¹,

Shin²

2021

Preprint

View full text Add to dashboard Cite

The sizes of pretrained language models make them challenging and expensive to use when there are multiple desired downstream tasks. In this work, we adopt recent strategies for model pruning during finetuning to explore the question of whether it is possible to prune a single encoder so that it can be used for multiple tasks. We allocate a fixed parameter budget and compare pruning a single model with a multitask objective against the best ensemble of single-task models. We find that under two pruning strategies (element-wise and rank pruning), the approach with the multitask objective outperforms training models separately when averaged across all tasks, and it is competitive on each individual one. Additional analysis finds that using a multitask objective during pruning can also be an effective method for reducing model sizes for low-resource tasks.

show abstract

Section: Pruning Methodsmentioning

confidence: 99%

Section: Runtime Comparisonmentioning

confidence: 99%

Pruning Pretrained Encoders with a Multitask Objective

Xia¹,

Shin²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Model compression and knowledge distillation present additional opportunities to improve inference performance further. While they are many ways for model compression, such as quantization [38,39,40] and pruning [41,42], our current efforts focus on layer reduction through knowledge distillation [43] (KD) -reducing both model size and model computation, and preserving MoE structure at student model. KD has been proven to be a successful way to compress a large model into a small one, which contains much fewer parameters and computations but still obtaining competitive results.…”

Section: Mixture-of-students: Distillation For Even Smaller Model Siz...mentioning

confidence: 99%

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Rajbhandari¹,

Li²,

Yao³

et al. 2022

Preprint

View full text Add to dashboard Cite

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

show abstract

“…As might be expected, the impact is dictated by the severity of the constraints. If the partitions are too small, or the blocks too large, accuracy becomes degraded to an unacceptable extent [40].…”

Section: Structured Sparsitymentioning

confidence: 99%

“…Block and partitioned sparsity help align the patterns of non-zero elements with hardware requirements, but are fundamentally at odds with creating highly sparse and accurate networks. Optimal performance requires large blocks and reduced partition sizes but this limits both the obtainable sparsity and the accuracy [40]. This in turn compromises the approaches from achieving the theoretical performance benefits of highly sparse networks.…”

Section: Complementary Sparsitymentioning

confidence: 99%

Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks

Hunter¹,

Spracklen²,

Ahmad³

2021

Preprint

View full text Add to dashboard Cite

In principle, sparse neural networks should be significantly more efficient than traditional dense networks. Neurons in the brain exhibit two types of sparsity; they are sparsely interconnected and sparsely active. These two types of sparsity, called weight sparsity and activation sparsity, when combined, offer the potential to reduce the computational cost of neural networks by two orders of magnitude. Despite this potential, today's neural networks deliver only modest performance benefits using just weight sparsity, because traditional computing hardware cannot efficiently process sparse networks. In this article we introduce Complementary Sparsity, a novel technique that significantly improves the performance of dual sparse networks on existing hardware. We demonstrate that we can achieve high performance running weight-sparse networks, and we can multiply those speedups by incorporating activation sparsity. Using Complementary Sparsity, we show up to 100X improvement in throughput and energy efficiency performing inference on FPGAs. We analyze scalability and resource tradeoffs for a variety of kernels typical of commercial convolutional networks such as ResNet-50 and MobileNetV2. Our results with Complementary Sparsity suggest that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models.

show abstract

Block Pruning For Faster Transformers

Cited by 61 publications

References 14 publications

Pruning Pretrained Encoders with a Multitask Objective

Pruning Pretrained Encoders with a Multitask Objective

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks

Contact Info

Product

Resources

About