Structured Pruning of Large Language Models

Wang, Ziheng; Wohlwend, Jeremy; Leí, Tao

doi:10.18653/v1/2020.emnlp-main.496

Cited by 82 publications

(68 citation statements)

References 48 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another class of approaches carefully selects weights to reduce model size. Lan et al (2020) use low-rank factorization to reduce the size of the embedding matrices, while Wang et al (2019f) factorize other weight matrices. Additionally, parameters can be shared between layers (Dehghani et al, 2019;Lan et al, 2020) or between an encoder and decoder (Raffel et al, 2019).…”

Section: Inferencementioning

confidence: 99%

Which *BERT? A Survey Organizing Contextualized Encoders

Xia

Durme

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Pretrained contextualized text encoders are now a staple of the NLP community. We present a survey on language representation learning with the aim of consolidating a series of shared lessons learned across a variety of recent efforts. While significant advancements continue at a rapid pace, we find that enough has now been discovered, in different directions, that we can begin to organize advances according to common themes. Through this organization, we highlight important considerations when interpreting recent contributions and choosing which model to use.

show abstract

Section: Inferencementioning

confidence: 99%

Which *BERT? A Survey Organizing Contextualized Encoders

Xia

Durme

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…This objective is still complicated by the discrete nature of z τ 's, but the expectation provides some guidance for empirically effective relaxations. We follow prior work (Louizos et al, 2018;Wang et al, 2019b) and relax z τ into continuous space [0, 1] d with a stretched Hard-Concrete distribution (Jang et al, 2017;Maddison et al, 2017), which allows for the use of pathwise gradient estimators. Specifically, z τ is now defined to be a deterministic and (sub)differentiable function of a sample u from a uniform distribution,…”

Section: Differentiable Approximation To Thementioning

confidence: 99%

“…For evaluation we use the GLUE benchmark (Wang et al, 2019b) as well as the SQuAD extractive question answering dataset (Rajpurkar et al, 2016 Devlin et al (2019) to compare against the adapter-based approach of Houlsby et al (2019). Our implementation is based on the Hugging Face Transformer library .…”

Section: Model and Datasetsmentioning

confidence: 99%

“…Structured diff pruning introduces an additional mask per group, which encourages pruning of entire groups. This is less restrictive than traditional group sparsity techniques that have been used with L 0 -norm relaxations, which force all parameters in a group to share the same mask (Louizos et al, 2018;Wang et al, 2019b). However we still expect entire groups to be pruned out more often, which might bias the learning process towards either eliminating completely or clustering together nonzero diffs.…”

Section: Structured Vs Non-structured Diff Pruningmentioning

confidence: 99%

See 1 more Smart Citation

Parameter-Efficient Transfer Learning with Diff Pruning

Guo

Rushton²,

Kim

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The large size of pretrained networks makes them difficult to deploy for multiple tasks in storage-constrained settings. Diff pruning enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific "diff" vector that extends the original pretrained parameters. This diff vector is adaptively pruned during training with a differentiable approximation to the L 0 -norm penalty to encourage sparsity. As the number of tasks increases, diff pruning remains parameter-efficient, as it requires storing only a small diff vector for each task. Since it does not require access to all tasks during training, it is attractive in on-device deployment settings where tasks arrive in stream or even from different providers. Diff pruning can match the performance of finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task and scales favorably in comparison to popular pruning approaches.

show abstract

“…To overcome the above shortcomings, a novel structured pruning paradigm was introduced [78] with low-rank factorization which retained the dense matrix structure and 0 norm which relaxed constraints enforced via structured pruning. The weight matrices were factorized into a product of two smaller matrices with a diagonal mask that was pruned while training via 0 regularizer that controlled the end sparsity of the model.…”

Section: Vi-b2-a Structured Pruningmentioning

confidence: 99%

The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

Singh

Mahmood

2021

IEEE Access

View full text Add to dashboard Cite

In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.

show abstract

Structured Pruning of Large Language Models

Cited by 82 publications

References 48 publications

Which *BERT? A Survey Organizing Contextualized Encoders

Which *BERT? A Survey Organizing Contextualized Encoders

Parameter-Efficient Transfer Learning with Diff Pruning

The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

Contact Info

Product

Resources

About