Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1424
|View full text |Cite
|
Sign up to set email alerts
|

Visualizing and Understanding the Effectiveness of BERT

Abstract: Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pretraining-then-fine-tuning paradigm can improve performance and generalization capability across different tasks. In this paper, we propose to visualize loss landscapes and optimization trajectories of fine-tuning BERT on specific datasets. First, we find that pre-training reaches a good initial point across downstream tasks, which leads to wider optima and easier optimization compared… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

4
70
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 123 publications
(82 citation statements)
references
References 19 publications
(30 reference statements)
4
70
0
Order By: Relevance
“…The Transformer allows the attention for a token to be spread over the entire input sequence, multiple times, intuitively capturing different properties. This characteristic has led to a line of research focusing on the interpretation of Transformer-based networks and their attention mechanisms (Raganato and Tiedemann, 2018;Tang et al, 2018;Mareček and Rosa, 2019;Voita et al, 2019a;Vig and Belinkov, 2019;Clark et al, 2019;Kovaleva et al, 2019;Tenney et al, 2019;Lin et al, 2019;Jawahar et al, 2019;van Schijndel et al, 2019;Hao et al, 2019b;Rogers et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…The Transformer allows the attention for a token to be spread over the entire input sequence, multiple times, intuitively capturing different properties. This characteristic has led to a line of research focusing on the interpretation of Transformer-based networks and their attention mechanisms (Raganato and Tiedemann, 2018;Tang et al, 2018;Mareček and Rosa, 2019;Voita et al, 2019a;Vig and Belinkov, 2019;Clark et al, 2019;Kovaleva et al, 2019;Tenney et al, 2019;Lin et al, 2019;Jawahar et al, 2019;van Schijndel et al, 2019;Hao et al, 2019b;Rogers et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, lower layers are more invariant and show s-class inference results similar to the pretrained model. Hao et al (2019), , Kovaleva et al (2019) make similar observations: lower layer representations are more transferable across different tasks and top layer representations are more task-specific after finetuning.…”
Section: Probing Resultsmentioning
confidence: 72%
“…They especially observe that Transformers' middle layers allow for a better transferability. On the other hand, the authors in [5] observe that the early layers of BERT are more invariant across tasks and hence more transferable. It has also been shown in [1] that, after fine tuning BERT on Question Answering, the model acts in different phases starting from capturing the semantic meaning of tokens in the first layers to separating the answer token from the others in the last layers.…”
Section: Related Workmentioning
confidence: 99%
“…This could be explained by the parameter sharing technique used to train the ALBERT model, which consists of duplicating the same parameters for all layers[5].…”
mentioning
confidence: 99%