2019
DOI: 10.48550/arxiv.1908.08593
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revealing the Dark Secrets of BERT

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
51
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 42 publications
(62 citation statements)
references
References 0 publications
3
51
1
Order By: Relevance
“…We plot the difference between the entropy of the unpruned heads before and after fine-tuning (see Figure 4). We notice that just like BERT (Kovaleva et al, 2019), the change in entropy is maximum for attention heads in the top layers (0.176) as compared to bottom (0.047) or middle layers (0.042). This suggests that mBERT adjusts the top layers more during fine-tuning.…”
Section: Cross-language Performancementioning
confidence: 73%
“…We plot the difference between the entropy of the unpruned heads before and after fine-tuning (see Figure 4). We notice that just like BERT (Kovaleva et al, 2019), the change in entropy is maximum for attention heads in the top layers (0.176) as compared to bottom (0.047) or middle layers (0.042). This suggests that mBERT adjusts the top layers more during fine-tuning.…”
Section: Cross-language Performancementioning
confidence: 73%
“…Clark et al (2019) analyze BERT's attention and observe that the bottom layers attend broadly, while the top layers capture linguistic syntax. Kovaleva et al (2019) find that the last few layers of BERT change the most after task-specific fine-tuning. Similar to our work, Houlsby et al (2019) fine-tune the top layers of BERT, as part of their baseline comparison for their model compression approach.…”
Section: Layerwise Interpretabilitymentioning
confidence: 86%
“…Michel et al (2019), for example, note that only a few attention heads need to be retained in each layer for acceptable effectiveness. Kovaleva et al (2019) find that, on many tasks, just the last few layers change the most after the fine-tuning process. We take these observations as evidence that only the last few layers necessarily need to be fine-tuned.…”
Section: Introductionmentioning
confidence: 95%
“…The visualizations show that BERT reaches a good initial point during pre-training for downstream tasks, which can lead to better optima compared to randomly-initialized models. Kovaleva et al (2019) visualize the attention heads of BERT, discovering a limited set of attention patterns across different heads. This leads to the fact that the heads of BERT are highly redundant.…”
Section: Analyzing Contextual Embeddingsmentioning
confidence: 99%