Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.260
|View full text |Cite
|
Sign up to set email alerts
|

On the weak link between importance and prunability of attention heads

Abstract: Given the success of Transformer-based models, two directions of study have emerged: interpreting role of individual attention heads and down-sizing the models for efficiency. Our work straddles these two streams: We analyse the importance of basing pruning strategies on the interpreted role of the attention heads. We evaluate this on Transformer and BERT models on multiple NLP tasks. Firstly, we find that a large fraction of the attention heads can be randomly pruned with limited effect on accuracy. Secondly,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 9 publications
(11 citation statements)
references
References 13 publications
(13 reference statements)
2
8
0
Order By: Relevance
“…However, we find that mBERT is just as robust to pruning as is BERT. At 50% random pruning, average accuracy drop with mBERT on the GLUE tasks is 2% relative to the base performance, similar to results for BERT reported in Budhraja et al(2020). Further, mBERT has identical preferences amongst layers to BERT, where (i) heads in the middle layers are more important than the ones in top and bottom layers, and (ii) consecutive layers cannot be simultaneously pruned.…”
Section: Introductionsupporting
confidence: 75%
See 3 more Smart Citations
“…However, we find that mBERT is just as robust to pruning as is BERT. At 50% random pruning, average accuracy drop with mBERT on the GLUE tasks is 2% relative to the base performance, similar to results for BERT reported in Budhraja et al(2020). Further, mBERT has identical preferences amongst layers to BERT, where (i) heads in the middle layers are more important than the ones in top and bottom layers, and (ii) consecutive layers cannot be simultaneously pruned.…”
Section: Introductionsupporting
confidence: 75%
“…One set of studies comment on the functional role and importance of attention heads in these models (Clark et al, 2019;Michel et al, 2019;Voita et al, 2019b,a;Liu et al, 2019a;Belinkov et al, 2017). Another set of studies have identified ways to make these models more efficient by methods such as pruning (McCarley, 2019;Gordon et al, 2020;Sajjad et al, 2020;Budhraja et al, 2020). A third set of studies show that multilingual extensions of these models, such as Multilingual BERT (Devlin et al, 2019), have surprisingly high crosslingual transfer (Pires et al, 2019;Wu and Dredze, 2019).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…This observation is in conformity with Prasanna et al (2020) and Sajjad et al (2021). Budhraja et al (2020) also highlight the importance of middle layers but finds no preference between top and bottom layers. For Enc-Dec (Figure 6b), we find that a lot more encoder-decoder cross attention heads are retained compared to the other two types of attentions (encoder and decoder self attentions).…”
Section: Distribution Of Headsmentioning
confidence: 92%