Telling BERT’s Full Story: from Local Attention to Global Aggregation

Pascual, Damián; Brunner, Gino; Wattenhofer, Roger

doi:10.18653/v1/2021.eacl-main.9

Cited by 14 publications

(13 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, one important step is to measure the importance of each token. To this end, we opted for saliency scores which have been recently shown as a reliable criterion in measuring token's contributions (Bastings and Filippova, 2020;Pascual et al, 2021). In Section 5.1 we will show results for a series quantitative analyses that supports this choice.…”

Section: Gradient-based Saliency Scoresmentioning

confidence: 99%

“…While self-attention is one of the most white-box components in transformer-based models, relying on raw attention weights as an explanation could be misleading given that they are not necessarily responsible for determining the contribution of each token in the final classifier's decision (Jain and Wallace, 2019;Serrano and Smith, 2019;Abnar and Zuidema, 2020). This is based on the fact that raw attentions are being faithful to the local mixture of information in each layer and are unable to obtain a global perspective of the information flow through the entire model (Pascual et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

AdapLeR: Speeding up Inference by Adaptive Length Reduction

M¹,

Mohebbi²,

Pilehvar³

2022

Preprint

View full text Add to dashboard Cite

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has a significantly lower false positive rate in generating rationales. Our code is freely available at https://github.com/amodaresi/ AdapLeR.

show abstract

Section: Gradient-based Saliency Scoresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

AdapLeR: Speeding up Inference by Adaptive Length Reduction

M¹,

Mohebbi²,

Pilehvar³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While these methods can be employed for singlelayer (local) analysis, multi-layer attributions are not necessarily correlated with single-layer attributions due to the significant degree of information combination through multi-layer language models (Pascual et al, 2021;Brunner et al, 2020). Various saliency methods exist for explaining the model's decision based on the input (Li et al, 2016;Bastings and Filippova, 2020;Atanasova et al, 2020;Wu and Ong, 2021;.…”

Section: Related Workmentioning

confidence: 99%

“…Additionally, gradient-based alternatives (Simonyan et al, 2014;Kindermans et al, 2016;Li et al, 2016) have been argued to provide a more robust basis for token attribution analysis (Atanasova et al, 2020;Brunner et al, 2020;Pascual et al, 2021). Nonetheless, the gradient-based alternatives have not been able to fully replace attention-based counterparts, mainly due to their high computational intensity.…”

mentioning

confidence: 99%

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

M¹,

Fayyaz²,

Yaghoobzadeh³

et al. 2022

Preprint

View full text Add to dashboard Cite

There has been a growing interest in interpreting the underlying dynamics of Transformers. While self-attention patterns were initially deemed as the primary option, recent studies have shown that integrating other components can yield more accurate explanations. This paper introduces a novel token attribution analysis method that incorporates all the components in the encoder block and aggregates this throughout layers. Through extensive quantitative and qualitative experiments, we demonstrate that our method can produce faithful and meaningful global token attributions. Our experiments reveal that incorporating almost every encoder component results in increasingly more accurate analysis in both local (single layer) and global (the whole model) settings. Our global attribution analysis significantly outperforms previous methods on various tasks regarding correlation with gradient-based saliency scores.

show abstract

“…Kobayashi et al (2020) extended the explainability of the model by also considering the magnitude of the vectors involved in the attention mechanism, and Kobayashi et al (2021) went as far as incorporating the layer normalization and the skip connection in their analysis. While these works have helped better understand the local behavior of the Transformer, there is a mismatch between layer-wise attention distributions and global input attributions (Pascual et al, 2021) since intermediate layers only attend to a mix of input tokens. Brunner et al (2020) quantified the aggregation of contextual information throughout the model with a gradient attribution method.…”

Section: Introductionmentioning

confidence: 99%

Measuring the Mixing of Contextual Information in the Transformer

Ferrando¹,

Gállego²,

Costa-jussà³

2022

Preprint

View full text Add to dashboard Cite

The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block -multi-head attention, residual connection, and layer normalization-and define a metric to measure token-to-token interactions within each layer, considering the characteristics of the representation space. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides faithful explanations and outperforms similar aggregation methods.

show abstract

Telling BERT’s Full Story: from Local Attention to Global Aggregation

Cited by 14 publications

References 17 publications

AdapLeR: Speeding up Inference by Adaptive Length Reduction

AdapLeR: Speeding up Inference by Adaptive Length Reduction

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

Measuring the Mixing of Contextual Information in the Transformer

Contact Info

Product

Resources

About