Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.211
|View full text |Cite
|
Sign up to set email alerts
|

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Abstract: The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned after training. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training, instead of doing so on a fully converged model. Our experiments on machine translation show that it is possible to remove up to three-quart… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(28 citation statements)
references
References 19 publications
1
27
0
Order By: Relevance
“…In addition, we conduct an one-shot pruning for computational simplicity. We leave other importance measures and pruning schedules, which may help identifying better generalized super tickets, for future works (Voita et al, 2019;Behnke and Heafield, 2020;Fan et al, 2019;Zhou et al, 2020;Sajjad et al, 2020). Searching Super Tickets Efficiently.…”
Section: Discussionmentioning
confidence: 99%
“…In addition, we conduct an one-shot pruning for computational simplicity. We leave other importance measures and pruning schedules, which may help identifying better generalized super tickets, for future works (Voita et al, 2019;Behnke and Heafield, 2020;Fan et al, 2019;Zhou et al, 2020;Sajjad et al, 2020). Searching Super Tickets Efficiently.…”
Section: Discussionmentioning
confidence: 99%
“…If this becomes possible we would be able to quantify the degree to which the world's writing systems have become on balance less logographic over time, an interesting computational twist on Gelb's original intuition. 35 For a different approach to this issue see Beinborn, Zesch, and Gurevych (2016), who train a model to predict spelling difficulty, based on corpora of spelling errors in three languages. 36 We note in passing that such burden of proof of broader interest is inconsistently applied across areas of computational linguistics.…”
Section: Discussionmentioning
confidence: 99%
“…One difficulty that naturally arises in the transformer setting is how to select the appropriate representation of attention weights given multiple self attention heads. There has been an increased research focus on analyzing the behavior of attention mechanisms in various flavors of transformer models in order to understand the linguistic function of the attention and also improve model compression schemes (Clark et al 2019;Michel, Levy, and Neubig 2019;Vig and Belinkov 2019;Voita et al 2019;Behnke and Heafield 2020;Wang et al 2020;Rogers, Kovaleva, and Rumshisky 2021). While in-depth investigation into the precise role the multiple attention heads play for logography is outside the scope of this work, we opt for a simple strategy whereby we inspect multiple attention heads in the top layer of the decoder-encoder attention block.…”
Section: Investigation Of Alternative Neural Attention Architecturesmentioning
confidence: 99%
“…Recent studies analyzed the roles of attention heads in the Transformer models either in language modeling (LM) (Michel et al, 2019;Clark et al, 2019;Jo and Myaeng, 2020) or NMT (Voita et al, 2019;Behnke and Heafield, 2020;Michel et al, 2019). It has been shown that a set of attention heads might be redundant at inference and can be pruned with almost no loss in performance.…”
Section: Related Workmentioning
confidence: 99%