High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Ba, Jimmy; Erdogdu, Murat A.; Suzuki, Taiji; Wang, Zhichao; Wu, Denny; Yang, Greg

doi:10.48550/arxiv.2205.01445

Cited by 2 publications

(3 citation statements)

References 30 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…∂ t U αα t = bU αα t (U αα t − 1) , (D. 22) where we observe if b > 0 this ODE is "mean avoiding" as it will drift towards 0 or ∞. And since the V t time scale is on the order of 1 n , for all t > 0 we have that…”

Section: D3 Proof Of Proposition 37 (Finite Time Explosion Criterion)mentioning

confidence: 86%

“…The Neural Tangent Kernel (NTK) limit formed the foundation for a rush of theoretical work, including advances in our understanding of generalization for wide networks [13][14][15]. Besides the NTK limit, the infinite-width mean-field limit was developed [16][17][18][19], where the different parameterization demonstrates benefits for feature learning and hyperparameter tuning [20][21][22].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

Li¹,

Nica²,

Roy³

2022

Preprint

View full text Add to dashboard Cite

The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers.To overcome this shortcoming, we study the random covariance matrix in the shaped infinitedepth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.

show abstract

Section: D3 Proof Of Proposition 37 (Finite Time Explosion Criterion)mentioning

confidence: 86%

Section: Introductionmentioning

confidence: 99%

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

Li¹,

Nica²,

Roy³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Learning representations. An existing line of work (Yehudai and Shamir, 2019;Allen-Zhu et al, 2019;Abbe et al, 2021;Damian et al, 2022;Ba et al, 2022) studies in depth the representations learned by neural networks trained with (S)GD at finite-width from a different perspective focusing on the advantages of feature-learning in terms of performance comparatively to using random features. In contrast, our aim is to describe the representations themselves in relationship with the symmetries of the problem.…”

Section: Related Workmentioning

confidence: 99%

On the symmetries in the dynamics of wide two-layer neural networks

Hajjar¹,

Chizat²

2022

Preprint

View full text Add to dashboard Cite

We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function f * and the input distribution, are preserved by the dynamics. We then study more specific cases. When f * is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When f * has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.

show abstract

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Cited by 2 publications

References 30 publications

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

On the symmetries in the dynamics of wide two-layer neural networks

Contact Info

Product

Resources

About