Revealing the Dark Secrets of BERT

Ковалева, Ольга Николаевна; Романов, А. Е.; Rogers, Anna Backman; Rumshisky, Anna

doi:10.48550/arxiv.1908.08593

Cited by 42 publications

(62 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We plot the difference between the entropy of the unpruned heads before and after fine-tuning (see Figure 4). We notice that just like BERT (Kovaleva et al, 2019), the change in entropy is maximum for attention heads in the top layers (0.176) as compared to bottom (0.047) or middle layers (0.042). This suggests that mBERT adjusts the top layers more during fine-tuning.…”

Section: Cross-language Performancementioning

confidence: 73%

On the Prunability of Attention Heads in Multilingual BERT

Budhraja¹,

Pande²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

Large multilingual models, such as mBERT, have shown promise in crosslingual transfer. In this work, we employ pruning to quantify the robustness and interpret layer-wise importance of mBERT. On four GLUE tasks, the relative drops in accuracy due to pruning have almost identical results on mBERT and BERT suggesting that the reduced attention capacity of the multilingual models does not affect robustness to pruning. For the crosslingual task XNLI, we report higher drops in accuracy with pruning indicating lower robustness in crosslingual transfer. Also, the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size. The top layers, which are relatively more influenced by fine-tuning, encode important information for languages similar to English (SVO) while the bottom layers, which are relatively less influenced by fine-tuning, are particularly important for agglutinative and low-resource languages.

show abstract

Section: Cross-language Performancementioning

confidence: 73%

On the Prunability of Attention Heads in Multilingual BERT

Budhraja¹,

Pande²,

Kumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Clark et al (2019) analyze BERT's attention and observe that the bottom layers attend broadly, while the top layers capture linguistic syntax. Kovaleva et al (2019) find that the last few layers of BERT change the most after task-specific fine-tuning. Similar to our work, Houlsby et al (2019) fine-tune the top layers of BERT, as part of their baseline comparison for their model compression approach.…”

Section: Layerwise Interpretabilitymentioning

confidence: 86%

“…Michel et al (2019), for example, note that only a few attention heads need to be retained in each layer for acceptable effectiveness. Kovaleva et al (2019) find that, on many tasks, just the last few layers change the most after the fine-tuning process. We take these observations as evidence that only the last few layers necessarily need to be fine-tuned.…”

Section: Introductionmentioning

confidence: 95%

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Lee¹,

Tang²,

Lin³

2019

Preprint

View full text Add to dashboard Cite

Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We examine two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability. We vary the number of final layers that are fine-tuned, then study the resulting change in task-specific effectiveness. We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality. Surprisingly, we also find that fine-tuning all layers does not always help.

show abstract

“…The visualizations show that BERT reaches a good initial point during pre-training for downstream tasks, which can lead to better optima compared to randomly-initialized models. Kovaleva et al (2019) visualize the attention heads of BERT, discovering a limited set of attention patterns across different heads. This leads to the fact that the heads of BERT are highly redundant.…”

Section: Analyzing Contextual Embeddingsmentioning

confidence: 99%

A Survey on Contextual Embeddings

Liu,

Kusner,

Blunsom

2020

Preprint

View full text Add to dashboard Cite

Contextual embeddings, such as ELMo and BERT, move beyond global word representations like Word2Vec and achieve groundbreaking performance on a wide range of natural language processing tasks. Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pretraining, the application of contextual embeddings in downstream tasks, model compression, and model analyses.

show abstract

Revealing the Dark Secrets of BERT

Cited by 42 publications

References 0 publications

On the Prunability of Attention Heads in Multilingual BERT

On the Prunability of Attention Heads in Multilingual BERT

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

A Survey on Contextual Embeddings

Contact Info

Product

Resources

About