2022
DOI: 10.48550/arxiv.2205.01445
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Abstract: We study the first gradient descent step on the first-layer parameters W in a two-layer neural network:, where W ∈ R d×N , a ∈ R N are randomly initialized, and the training objective is the empirical MSE loss: 1 n n i=1 (f (xi) − yi) 2 . In the proportional asymptotic limit where n, d, N → ∞ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 30 publications
(45 reference statements)
0
3
0
Order By: Relevance
“…∂ t U αα t = bU αα t (U αα t − 1) , (D. 22) where we observe if b > 0 this ODE is "mean avoiding" as it will drift towards 0 or ∞. And since the V t time scale is on the order of 1 n , for all t > 0 we have that…”
Section: D3 Proof Of Proposition 37 (Finite Time Explosion Criterion)mentioning
confidence: 86%
See 1 more Smart Citation
“…∂ t U αα t = bU αα t (U αα t − 1) , (D. 22) where we observe if b > 0 this ODE is "mean avoiding" as it will drift towards 0 or ∞. And since the V t time scale is on the order of 1 n , for all t > 0 we have that…”
Section: D3 Proof Of Proposition 37 (Finite Time Explosion Criterion)mentioning
confidence: 86%
“…The Neural Tangent Kernel (NTK) limit formed the foundation for a rush of theoretical work, including advances in our understanding of generalization for wide networks [13][14][15]. Besides the NTK limit, the infinite-width mean-field limit was developed [16][17][18][19], where the different parameterization demonstrates benefits for feature learning and hyperparameter tuning [20][21][22].…”
Section: Introductionmentioning
confidence: 99%
“…Learning representations. An existing line of work (Yehudai and Shamir, 2019;Allen-Zhu et al, 2019;Abbe et al, 2021;Damian et al, 2022;Ba et al, 2022) studies in depth the representations learned by neural networks trained with (S)GD at finite-width from a different perspective focusing on the advantages of feature-learning in terms of performance comparatively to using random features. In contrast, our aim is to describe the representations themselves in relationship with the symmetries of the problem.…”
Section: Related Workmentioning
confidence: 99%