Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.496
|View full text |Cite
|
Sign up to set email alerts
|

Structured Pruning of Large Language Models

Abstract: Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively remov… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
60
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 82 publications
(68 citation statements)
references
References 48 publications
(41 reference statements)
0
60
0
Order By: Relevance
“…Another class of approaches carefully selects weights to reduce model size. Lan et al (2020) use low-rank factorization to reduce the size of the embedding matrices, while Wang et al (2019f) factorize other weight matrices. Additionally, parameters can be shared between layers (Dehghani et al, 2019;Lan et al, 2020) or between an encoder and decoder (Raffel et al, 2019).…”
Section: Inferencementioning
confidence: 99%
“…Another class of approaches carefully selects weights to reduce model size. Lan et al (2020) use low-rank factorization to reduce the size of the embedding matrices, while Wang et al (2019f) factorize other weight matrices. Additionally, parameters can be shared between layers (Dehghani et al, 2019;Lan et al, 2020) or between an encoder and decoder (Raffel et al, 2019).…”
Section: Inferencementioning
confidence: 99%
“…This objective is still complicated by the discrete nature of z τ 's, but the expectation provides some guidance for empirically effective relaxations. We follow prior work (Louizos et al, 2018;Wang et al, 2019b) and relax z τ into continuous space [0, 1] d with a stretched Hard-Concrete distribution (Jang et al, 2017;Maddison et al, 2017), which allows for the use of pathwise gradient estimators. Specifically, z τ is now defined to be a deterministic and (sub)differentiable function of a sample u from a uniform distribution,…”
Section: Differentiable Approximation To Thementioning
confidence: 99%
“…For evaluation we use the GLUE benchmark (Wang et al, 2019b) as well as the SQuAD extractive question answering dataset (Rajpurkar et al, 2016 Devlin et al (2019) to compare against the adapter-based approach of Houlsby et al (2019). Our implementation is based on the Hugging Face Transformer library .…”
Section: Model and Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…To overcome the above shortcomings, a novel structured pruning paradigm was introduced [78] with low-rank factorization which retained the dense matrix structure and 0 norm which relaxed constraints enforced via structured pruning. The weight matrices were factorized into a product of two smaller matrices with a diagonal mask that was pruned while training via 0 regularizer that controlled the end sparsity of the model.…”
Section: Vi-b2-a Structured Pruningmentioning
confidence: 99%