Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.133
|View full text |Cite
|
Sign up to set email alerts
|

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Abstract: The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( 2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 10 publications
(6 citation statements)
references
References 22 publications
0
6
0
Order By: Relevance
“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”
Section: Introductionmentioning
confidence: 99%
“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”
Section: Introductionmentioning
confidence: 99%
“…In §4.1, we claimed that the scaling effect of layer normalization has no effect on the decisions of our constructions for PARITY and FIRST. This is related to the property of approximate homogeneity studied by Merrill et al (2021).…”
Section: A Correctness Of Parity Constructionmentioning
confidence: 96%
“…We also tested Euclidean distance but it did not produce clear clusters. This is likely caused by the growth of the norm of the weight vector during training (Merrill et al, 2020) that is unrelated to the data at hand ( §C). This explanation can also explain questions that were previously left open (Qin et al, 2022).…”
Section: Clusteringmentioning
confidence: 99%