Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.360
|View full text |Cite
|
Sign up to set email alerts
|

Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Abstract: Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English→German and English→French tasks, we show that stabilized lottery ticket p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 23 publications
(16 citation statements)
references
References 9 publications
0
16
0
Order By: Relevance
“…In this paper, we concentrate on a specific case of block sparsity that removes entire attention heads from a model without masking. Brix et al (2020) applied the lottery ticket hypothesis and other techniques to prune individual coefficients from a transformer for machine translation. In their experiments, a stabilised version of lottery ticket pruning damages translation quality by 2 BLEU points while removing 80% of all parameters.…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we concentrate on a specific case of block sparsity that removes entire attention heads from a model without masking. Brix et al (2020) applied the lottery ticket hypothesis and other techniques to prune individual coefficients from a transformer for machine translation. In their experiments, a stabilised version of lottery ticket pruning damages translation quality by 2 BLEU points while removing 80% of all parameters.…”
Section: Related Workmentioning
confidence: 99%
“…of such a collection of tickets, which is usually referred to as "winning tickets", indicates the potential of training a smaller network to achieve the full model's performance. LTH has been widely explored in across various fields of deep learning (Frankle et al, 2019;You et al, 2019;Brix et al, 2020;Movva and Zhao, 2020;Girish et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Pruning neural networks The literature on pruning neural networks is decades old (Mozer and Smolensky, 1989;Cun et al, 1990;Hassibi and Stork, 1993), but has recently seen a resurgence with the all-encompassing success of neural networks and the need for small and fast on-device model inference (Han et al, 2015;Sze et al, 2017;Frankle and Carbin, 2018;Frankle et al, 2019). In NLP, specifically, pruning methods have been applied to recurrent neural networks (Desai et al, 2019;Yu et al, 2020), as well as transformers Brix et al, 2020;Prasanna et al, 2020;Chen et al, 2020;Sanh et al, 2020).…”
Section: Related Workmentioning
confidence: 99%