Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.510
|View full text |Cite
|
Sign up to set email alerts
|

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Abstract: The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
24
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(25 citation statements)
references
References 36 publications
1
24
0
Order By: Relevance
“…Michel et al (2019) propose a simple gradient-based importance score to prune attention heads. Prasanna et al (2020); Liang et al (2021) extend to prune other components like the feedforward network of the Transformer (Vaswani et al, 2017). Wang et al (2020c) decompose the pre-trained model weights and apply L 0 regularization (Louizos et al, 2018) to regulate the ranks of decomposed weights.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Michel et al (2019) propose a simple gradient-based importance score to prune attention heads. Prasanna et al (2020); Liang et al (2021) extend to prune other components like the feedforward network of the Transformer (Vaswani et al, 2017). Wang et al (2020c) decompose the pre-trained model weights and apply L 0 regularization (Louizos et al, 2018) to regulate the ranks of decomposed weights.…”
Section: Related Workmentioning
confidence: 99%
“…Large-scale pre-trained monolingual language models like BERT (Devlin et al, 2019) and RoBERTa (Liu et al, 2019) have shown promising results in various NLP tasks while suffering from their large model size and high latency. Structured pruning has proven to be an effective approach to compressing and accelerating these large monolingual language models (Michel et al, 2019;Wang et al, 2020c;Prasanna et al, 2020;Liang et al, 2021), making them practical for real-world applications.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…You et al (2020) draw early-bird tickets (prune the original network) at an early stage of training, and only train the subnetwork from then on. Some recent works extend the LTH from random initialization to pre-trained initialization (Prasanna et al, 2020;Liang et al, 2021;Chen et al, 2021b). Particularly, find that WTs, i.e., subnetworks of the pre-trained BERT, derived from the pre-training task of MLM using IMP are universally transferable to the downstream tasks.…”
Section: The Lottery Ticket Hypothesismentioning
confidence: 99%