On the weak link between importance and prunability of attention heads

Budhraja, Aakriti; Pande, Madhura; Nema, Preksha; Kumar, Pratyush; Khapra, Mitesh M.

doi:10.18653/v1/2020.emnlp-main.260

Cited by 9 publications

(11 citation statements)

References 13 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, we find that mBERT is just as robust to pruning as is BERT. At 50% random pruning, average accuracy drop with mBERT on the GLUE tasks is 2% relative to the base performance, similar to results for BERT reported in Budhraja et al(2020). Further, mBERT has identical preferences amongst layers to BERT, where (i) heads in the middle layers are more important than the ones in top and bottom layers, and (ii) consecutive layers cannot be simultaneously pruned.…”

Section: Introductionsupporting

confidence: 75%

“…One set of studies comment on the functional role and importance of attention heads in these models (Clark et al, 2019;Michel et al, 2019;Voita et al, 2019b,a;Liu et al, 2019a;Belinkov et al, 2017). Another set of studies have identified ways to make these models more efficient by methods such as pruning (McCarley, 2019;Gordon et al, 2020;Sajjad et al, 2020;Budhraja et al, 2020). A third set of studies show that multilingual extensions of these models, such as Multilingual BERT (Devlin et al, 2019), have surprisingly high crosslingual transfer (Pires et al, 2019;Wu and Dredze, 2019).…”

Section: Introductionmentioning

confidence: 99%

“…We then fine-tune and evaluate it on four GLUE tasks: MNLI-M, QQP, QNLI, SST-2 (Wang et al, 2018); and compare it with BERT BASE . Following the evaluation setup as in Budhraja et al(2020), we either (i) randomly prune k% of attention heads in mBERT, or (ii) prune all heads in specific layers (top, middle, or bottom). In each case, after pruning we finetune the model for 10 epochs.…”

Section: Introductionmentioning

confidence: 99%

“…From their pruning experiments, Budhraja et al(2020) show that the middle layers in BERT are more important than the top or bottom layers: The performance of BERT does not drop much when pruning top or bottom layers as compared to middle layers. We perform similar experiments with mBERT and find that the results are consistent.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the Prunability of Attention Heads in Multilingual BERT

Budhraja¹,

Pande²,

Kumar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are particularly important for agglutinative and low-resource languages.

show abstract

Section: Introductionsupporting

confidence: 75%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Prunability of Attention Heads in Multilingual BERT

Budhraja¹,

Pande²,

Kumar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This observation is in conformity with Prasanna et al (2020) and Sajjad et al (2021). Budhraja et al (2020) also highlight the importance of middle layers but finds no preference between top and bottom layers. For Enc-Dec (Figure 6b), we find that a lot more encoder-decoder cross attention heads are retained compared to the other two types of attentions (encoder and decoder self attentions).…”

Section: Distribution Of Headsmentioning

confidence: 92%

Differentiable Subset Pruning of Transformer Heads

Cotterell

Sachan

2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1

show abstract

Explainable Attention Pruning: A Metalearning-Based Approach

Rajapaksha,

Crespi

2024

IEEE Trans. Artif. Intell.

View full text Add to dashboard Cite

Pruning, as a technique to reduce the complexity and size of Transformer-based models, has gained significant attention in recent years. While various models have been successfully pruned, pruning BERT poses unique challenges due to their fine-grained structure and overparameterization. However, by carefully considering these factors, it is possible to prune BERT without significantly degrading its pre-trained loss. In this paper, we propose a Meta-learning-based pruning approach that can adaptively identify and eliminate insignificant attention weights. The performance of the proposed model is compared with several baseline models, as well as the default fine-tuned BERT model. The baseline pruning strategies employed low-level pruning techniques, targeting the removal of only 20% of the connections. The experimental results show that the proposed model outperforms the other baseline models, in terms of lower inference latency, higher MCC and lower loss. However, there is no significant improvement observed in terms of average FLOPs (floating-point operations per second). Furthermore, we conduct a comparative evaluation of the baseline models and our proposed model using two explainable (XAI) approaches. While other models allocate reasonable attention to less significant words for sentiment classification, our model assigns higher probabilities to the most significant sentimental words.Impact Statement-Efficient handling of inference time in pre-trained language models (PLMs) and the preservation of performance while reducing their size are important research considerations. Model compression techniques, such as pruning, are recognized as effective approaches for achieving memoryefficient, energy-efficient, computation-efficient, and storageefficient PLMs. Pruning addresses the need to create compact models without compromising their overall effectiveness. Existing pruning methods often rely on task and domain-specific approaches and therefore, it is important to explore a domainindependent pruning approach. We propose a new pruning strategy called Meta-Controller-based Attention Pruning (MCAP) for the BERT model targeting single-sentence prediction tasks. MCAP optimization strategy eliminates insignificant attention in the BERT by calculating their importance scores. The selfsupervised pruner in MCAP uses a meta-learning approach to identify and eliminate these insignificant attentions before finetuning. Our study compares MCAP with baseline models (both structured and unstructured pruning) and compared it with inference latency, MCC, and loss parameters. The results show that MCAP outperforms the baseline models in terms of inference latency, MCC, and loss. Explainable AI (XAI) techniques are used to interpret the model's decisions and predictions. MCAP focuses on significant words in sentiment classification, ensuring important model parameters are retained without a significant impact on output.

show abstract

On the weak link between importance and prunability of attention heads

Cited by 9 publications

References 13 publications

On the Prunability of Attention Heads in Multilingual BERT

On the Prunability of Attention Heads in Multilingual BERT

Differentiable Subset Pruning of Transformer Heads

Explainable Attention Pruning: A Metalearning-Based Approach

Contact Info

Product

Resources

About