Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Chen, Liang; Zuo, Simiao; Chen, Minshuo; Jiang, Haoming; Liu, Xiaodong; He, Pengcheng; Zhao, Tuo; Chen, Weizhu

doi:10.18653/v1/2021.acl-long.510

Cited by 21 publications

(25 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Michel et al (2019) propose a simple gradient-based importance score to prune attention heads. Prasanna et al (2020); Liang et al (2021) extend to prune other components like the feedforward network of the Transformer (Vaswani et al, 2017). Wang et al (2020c) decompose the pre-trained model weights and apply L 0 regularization (Louizos et al, 2018) to regulate the ranks of decomposed weights.…”

Section: Related Workmentioning

confidence: 99%

“…Large-scale pre-trained monolingual language models like BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019) have shown promising results in various NLP tasks while suffering from their large model size and high latency. Structured pruning has proven to be an effective approach to compressing and accelerating these large monolingual language models (Michel et al, 2019;Wang et al, 2020c;Prasanna et al, 2020;Liang et al, 2021), making them practical for real-world applications.…”

Section: Introductionmentioning

confidence: 99%

“…Algorithms There exists a broad spectrum of pruning algorithms (Hoefler et al, 2021), and it is impossible to test all of them considering the cost of pre-training. We focus on two pruning algorithms that have been studied the most in monolingual pretrained models: the regularization-based pruning (Louizos et al, 2018;Wang et al, 2020c) (and our improved version) and the gradient-based pruning (Michel et al, 2019;Prasanna et al, 2020;Liang et al, 2021) (See Section 4). We experimentally find that the simplest gradient-based pruning is more effective for XLM-R (See Section 5.2).…”

Section: Introductionmentioning

confidence: 99%

“…In this section, we briefly review the structure of XLM-R , a Transformer encoder (Vaswani et al, 2017) pre-trained by masked language modeling task (Devlin et al, 2019). We also revisit how conventional structured pruning algorithms are applied to Transformers by introducing additional gating variables and setting appropriate values to them (See Figure 1 and also Prasanna et al (2020); Liang et al (2021)). The XLM-R model consists of N layers.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Li¹,

Luo²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Structured pruning has been extensively studied on monolingual pre-trained language models and is yet to be fully evaluated on their multilingual counterparts. This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency. Experiments on nine downstream tasks show several counterintuitive phenomena: for settings, individually pruning for each language does not induce a better result; for algorithms, the simplest method performs the best; for efficiency, a fast model does not imply that it is also small. To facilitate the comparison on all sparsity levels, we present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference. We hope this work fills the gap in the study of structured pruning on multilingual pre-trained models and sheds light on future research. * Collaborated work while doing an Alibaba DAMO Academy internship.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Li¹,

Luo²,

Xu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…You et al (2020) draw early-bird tickets (prune the original network) at an early stage of training, and only train the subnetwork from then on. Some recent works extend the LTH from random initialization to pre-trained initialization (Prasanna et al, 2020;Liang et al, 2021;Chen et al, 2021b). Particularly, find that WTs, i.e., subnetworks of the pre-trained BERT, derived from the pre-training task of MLM using IMP are universally transferable to the downstream tasks.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Liu¹,

Meng²,

Zheng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code is available at https://github.com/llyx97/TAMT.

show abstract

Doge Tickets: Uncovering Domain-General Language Models by Playing Lottery Tickets

Yang

Zhang

Wang

et al. 2022

Natural Language Processing and Chinese Computing

View full text Add to dashboard Cite

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Cited by 21 publications

References 36 publications

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Probing Structured Pruning on Multilingual Pre-trained Models: Settings, Algorithms, and Efficiency

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Doge Tickets: Uncovering Domain-General Language Models by Playing Lottery Tickets

Contact Info

Product

Resources

About