2021
DOI: 10.48550/arxiv.2111.05754
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Prune Once for All: Sparse Pre-Trained Language Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 0 publications
0
7
0
Order By: Relevance
“…Chen et al (2020a) show a 70%-sparsity model retains the MLM accuracy produced by iterative magnitude pruning. Zafrir et al (2021) show the potential advantage of upstream unstructured pruning against downstream pruning. We consider applying CoFi for upstream pruning as a promising future direction to produce task-agnostic models with flexible structures.…”
Section: Related Workmentioning
confidence: 97%
“…Chen et al (2020a) show a 70%-sparsity model retains the MLM accuracy produced by iterative magnitude pruning. Zafrir et al (2021) show the potential advantage of upstream unstructured pruning against downstream pruning. We consider applying CoFi for upstream pruning as a promising future direction to produce task-agnostic models with flexible structures.…”
Section: Related Workmentioning
confidence: 97%
“…Second, we implement a multi-segment variant of late interaction [19] when adapting M6-Rec to tasks that require low-latency real-time inference, where most of the heavy computation is pre-computed offline and cached. Finally, to make M6-Rec deployable on edge devices such as mobile phones, we further employ techniques such as parameter sharing [20], pruning [58], quantization [57], and early-exiting [16,48] to reduce the model size and accelerate the inference speed. In summary, our main contributions are:…”
Section: Attention Maskmentioning
confidence: 99%
“…Reducing the model size keeps hardware costs down and is mandated for resource-limited edge devices. Many strategies have been explored, e.g., parameter sharing [20], distillation [17,41,47,50], pruning [5,12,58], and quantization [57]. Still, the existing tiny language models usually have over >10M parameters, while we estimate that it needs to be around 2M to avoid degrading the user experience when deploying a model to our users' mobile phones.…”
Section: Efficient Language Foundationsmentioning
confidence: 99%
“…Different architectures have been explored in this respect, choosing to use an extractive open QA (the answers come strictly from the context) Intel/bert-large-uncased-squadv1.1sparse-80-1x4-block-pruneofa [41] for the experiments (with an f1-score of 91.174 on SQuADv1.1). Some significant tests have been carried out on this model to validate the possibilities of this new interaction.…”
Section: Nlp Using Transformers With Questions and Answersmentioning
confidence: 99%