Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.9
|View full text |Cite
|
Sign up to set email alerts
|

Telling BERT’s Full Story: from Local Attention to Global Aggregation

Abstract: We take a deep look into the behaviour of selfattention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model's behaviour, we show that attention distributions can nevertheless provide insights into the local behaviour of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(13 citation statements)
references
References 17 publications
0
13
0
Order By: Relevance
“…Therefore, one important step is to measure the importance of each token. To this end, we opted for saliency scores which have been recently shown as a reliable criterion in measuring token's contributions (Bastings and Filippova, 2020;Pascual et al, 2021). In Section 5.1 we will show results for a series quantitative analyses that supports this choice.…”
Section: Gradient-based Saliency Scoresmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, one important step is to measure the importance of each token. To this end, we opted for saliency scores which have been recently shown as a reliable criterion in measuring token's contributions (Bastings and Filippova, 2020;Pascual et al, 2021). In Section 5.1 we will show results for a series quantitative analyses that supports this choice.…”
Section: Gradient-based Saliency Scoresmentioning
confidence: 99%
“…While self-attention is one of the most white-box components in transformer-based models, relying on raw attention weights as an explanation could be misleading given that they are not necessarily responsible for determining the contribution of each token in the final classifier's decision (Jain and Wallace, 2019;Serrano and Smith, 2019;Abnar and Zuidema, 2020). This is based on the fact that raw attentions are being faithful to the local mixture of information in each layer and are unable to obtain a global perspective of the information flow through the entire model (Pascual et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…While these methods can be employed for singlelayer (local) analysis, multi-layer attributions are not necessarily correlated with single-layer attributions due to the significant degree of information combination through multi-layer language models (Pascual et al, 2021;Brunner et al, 2020). Various saliency methods exist for explaining the model's decision based on the input (Li et al, 2016;Bastings and Filippova, 2020;Atanasova et al, 2020;Wu and Ong, 2021;.…”
Section: Related Workmentioning
confidence: 99%
“…Additionally, gradient-based alternatives (Simonyan et al, 2014;Kindermans et al, 2016;Li et al, 2016) have been argued to provide a more robust basis for token attribution analysis (Atanasova et al, 2020;Brunner et al, 2020;Pascual et al, 2021). Nonetheless, the gradient-based alternatives have not been able to fully replace attention-based counterparts, mainly due to their high computational intensity.…”
mentioning
confidence: 99%
“…Kobayashi et al (2020) extended the explainability of the model by also considering the magnitude of the vectors involved in the attention mechanism, and Kobayashi et al (2021) went as far as incorporating the layer normalization and the skip connection in their analysis. While these works have helped better understand the local behavior of the Transformer, there is a mismatch between layer-wise attention distributions and global input attributions (Pascual et al, 2021) since intermediate layers only attend to a mix of input tokens. Brunner et al (2020) quantified the aggregation of contextual information throughout the model with a gradient attribution method.…”
Section: Introductionmentioning
confidence: 99%