Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Brix, Christopher; Bahar, Parnia; Ney, Hermann

doi:10.18653/v1/2020.acl-main.360

Cited by 23 publications

(16 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we concentrate on a specific case of block sparsity that removes entire attention heads from a model without masking. Brix et al (2020) applied the lottery ticket hypothesis and other techniques to prune individual coefficients from a transformer for machine translation. In their experiments, a stabilised version of lottery ticket pruning damages translation quality by 2 BLEU points while removing 80% of all parameters.…”

Section: Related Workmentioning

confidence: 99%

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Behnke¹,

Heafield²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned after training. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training, instead of doing so on a fully converged model. Our experiments on machine translation show that it is possible to remove up to three-quarters of all attention heads from a transformer-big model with an average −0.1 change in BLEU for Turkish→English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. The method is complementary to other approaches, such as teacher-student, with our English→German student losing 0.2 BLEU at 75% encoder attention sparsity.

show abstract

Section: Related Workmentioning

confidence: 99%

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Behnke¹,

Heafield²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…of such a collection of tickets, which is usually referred to as "winning tickets", indicates the potential of training a smaller network to achieve the full model's performance. LTH has been widely explored in across various fields of deep learning (Frankle et al, 2019;You et al, 2019;Brix et al, 2020;Movva and Zhao, 2020;Girish et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Chen

Zuo

Chen

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as "super tickets". We further show that the phase transition is task and model dependent -as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning 1 .

show abstract

“…Pruning neural networks The literature on pruning neural networks is decades old (Mozer and Smolensky, 1989;Cun et al, 1990;Hassibi and Stork, 1993), but has recently seen a resurgence with the all-encompassing success of neural networks and the need for small and fast on-device model inference (Han et al, 2015;Sze et al, 2017;Frankle and Carbin, 2018;Frankle et al, 2019). In NLP, specifically, pruning methods have been applied to recurrent neural networks (Desai et al, 2019;Yu et al, 2020), as well as transformers Brix et al, 2020;Prasanna et al, 2020;Chen et al, 2020;Sanh et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Is the Lottery Fair? Evaluating Winning Tickets Across Demographics

Hansen

Søgaard

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Recent studies have suggested that weight pruning, e.g. using lottery ticket extraction techniques (Frankle and Carbin, 2018), comes at the risk of compromising the group fairness of machine learning models (Paganini, 2020;Hooker et al., 2020), but to the best of our knowledge, no one has empirically evaluated this hypothesis at scale in the context of natural language processing. We present experiments with two text classification datasets annotated with demographic information: the Trustpilot Corpus (sentiment) and CivilComments (toxicity). We evaluate the fairness of lottery ticket extraction through layer-wise and global weight pruning across three languages and two tasks. Our results suggest that there is a small increase in group disparity, which is most pronounced at high pruning rates and correlates with instability. The fairness of models trained with distributionally robust optimization objectives is sometimes less sensitive to pruning, but results are not consistent. The code for our experiments is available at https://github. com/vpetren/fairness_lottery.

show abstract

Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Cited by 23 publications

References 9 publications

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Is the Lottery Fair? Evaluating Winning Tickets Across Demographics

Contact Info

Product

Resources

About