Visualizing and Understanding the Effectiveness of BERT

Hao, Yaru; Dong, Li; Wei, Furu; Xu, Ke

doi:10.18653/v1/d19-1424

Cited by 123 publications

(82 citation statements)

References 19 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Transformer allows the attention for a token to be spread over the entire input sequence, multiple times, intuitively capturing different properties. This characteristic has led to a line of research focusing on the interpretation of Transformer-based networks and their attention mechanisms (Raganato and Tiedemann, 2018;Tang et al, 2018;Mareček and Rosa, 2019;Voita et al, 2019a;Vig and Belinkov, 2019;Clark et al, 2019;Kovaleva et al, 2019;Tenney et al, 2019;Lin et al, 2019;Jawahar et al, 2019;van Schijndel et al, 2019;Hao et al, 2019b;Rogers et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

Section: Related Workmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…In contrast, lower layers are more invariant and show s-class inference results similar to the pretrained model. Hao et al (2019), , Kovaleva et al (2019) make similar observations: lower layer representations are more transferable across different tasks and top layer representations are more task-specific after finetuning.…”

Section: Probing Resultsmentioning

confidence: 72%

Quantifying the Contextualization of Word Representations with Semantic Class Probing

Zhao¹,

Dufter²,

Yaghoobzadeh³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Pretrained language models achieve state-ofthe-art results on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embedding. Quantifying contextualization helps in understanding and utilizing pretrained language models. We show that the top layer representations support highly accurate inference of semantic classes; that the strongest contextualization effects occur in the lower layers; that local context is mostly sufficient for contextualizing words; and that top layer representations are more task-specific after finetuning while lower layer representations are more transferable. Finetuning uncovers task-related features, but pretrained knowledge about contextualization is still well preserved.

show abstract

“…They especially observe that Transformers' middle layers allow for a better transferability. On the other hand, the authors in [5] observe that the early layers of BERT are more invariant across tasks and hence more transferable. It has also been shown in [1] that, after fine tuning BERT on Question Answering, the model acts in different phases starting from capturing the semantic meaning of tokens in the first layers to separating the answer token from the others in the last layers.…”

Section: Related Workmentioning

confidence: 99%

“…This could be explained by the parameter sharing technique used to train the ALBERT model, which consists of duplicating the same parameters for all layers[5].…”

mentioning

confidence: 99%

Unsupervised Methods for the Study of Transformer Embeddings

Saada

Role

Nadif

2021

Advances in Intelligent Data Analysis XIX

View full text Add to dashboard Cite

Over the last decade neural word embeddings have become a cornerstone of many important text mining applications such as text classification, sentiment analysis, named entity recognition, question answering systems, etc. Particularly, Transformer-based contextual word embeddings have gained much attention with several works trying to understanding how such models work, through the use of supervised probing tasks, and usually emphasizing on BERT. In this paper, we propose a fully unsupervised manner to analyze Transformer-based embedding models in their bare state with no fine-tuning. We more precisely focus on characterizing and identifying groups of Transformer layers across 6 different Transformer models.

show abstract

Visualizing and Understanding the Effectiveness of BERT

Cited by 123 publications

References 19 publications

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Quantifying the Contextualization of Word Representations with Semantic Class Probing

Unsupervised Methods for the Study of Transformer Embeddings

Contact Info

Product

Resources

About