2022
DOI: 10.48550/arxiv.2203.07259
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Abstract: Pre-trained Transformer-based language models have become a key building block for natural language processing (NLP) tasks. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to be applicable to decrease model size and increase inference speed. In this context, this paper's contributions are two-fold. We begin with an i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
23
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(23 citation statements)
references
References 15 publications
0
23
0
Order By: Relevance
“…Singh and Alistarh [2020] investigated a diagonal block-wise approximation with a predefined block size B, which reduces storage cost from O(d 2 ) to O(Bd), and showed that this approach can lead to strong results when pruning CNNs. Kurtic et al [2022] proposed a formula for block pruning, together with a set of non-trivial optimizations to efficiently compute the block inverse, which allowed them to scale the approach for the first time to large language models.…”
Section: Background and Problem Setupmentioning
confidence: 99%
See 4 more Smart Citations
“…Singh and Alistarh [2020] investigated a diagonal block-wise approximation with a predefined block size B, which reduces storage cost from O(d 2 ) to O(Bd), and showed that this approach can lead to strong results when pruning CNNs. Kurtic et al [2022] proposed a formula for block pruning, together with a set of non-trivial optimizations to efficiently compute the block inverse, which allowed them to scale the approach for the first time to large language models.…”
Section: Background and Problem Setupmentioning
confidence: 99%
“…where E Q ∈ R |Q|×d is a matrix of basis vectors for each weight in Q. The corresponding saliency score for the group of weights Q and the update δw * Q of remaining weights is [Kurtic et al, 2022]:…”
Section: Background and Problem Setupmentioning
confidence: 99%
See 3 more Smart Citations