2019
DOI: 10.48550/arxiv.1902.04760
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Abstract: Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline tensor program that can express most neural network computations, and we characterize its scaling limit … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
196
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 111 publications
(202 citation statements)
references
References 56 publications
3
196
0
Order By: Relevance
“…Several existing works (Vaskevicius et al, 2019;Woodworth et al, 2020;Zhao et al, 2019) have shown that for the quadratically overparametrized linear model, i.e., w = u 2 − v 2 or w = u v, gradient descent/flow from small initialization implicitly regularizes 1 norm and provides better generalization when the groundtruth is sparse. This is in sharp contrast to the kernel regime, where neural networks trained by gradient descent behaves like kernel methods (Daniely, 2017;Jacot et al, 2018;Yang, 2019). This allows one to prove convergence to zero loss solutions in overparametrized settings Du et al, 2018;Allen-Zhu et al, 2019b;a;Du et al, 2019), where the learnt function minimizes the corresponding RKHS norm (Arora et al, 2019b;Chizat et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Several existing works (Vaskevicius et al, 2019;Woodworth et al, 2020;Zhao et al, 2019) have shown that for the quadratically overparametrized linear model, i.e., w = u 2 − v 2 or w = u v, gradient descent/flow from small initialization implicitly regularizes 1 norm and provides better generalization when the groundtruth is sparse. This is in sharp contrast to the kernel regime, where neural networks trained by gradient descent behaves like kernel methods (Daniely, 2017;Jacot et al, 2018;Yang, 2019). This allows one to prove convergence to zero loss solutions in overparametrized settings Du et al, 2018;Allen-Zhu et al, 2019b;a;Du et al, 2019), where the learnt function minimizes the corresponding RKHS norm (Arora et al, 2019b;Chizat et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…In the literature on signal propagation at initialization (e.g. Poole et al, 2016;Hayou et al, 2019;Yang and Schoenholz, 2017;Yang, 2019a;Xiao et al, 2018)), results on gradient backpropagation rely on the assumption that the weights used for backpropagation are independent from the ones used for forward propagation. Yang (2020) showed that this assumption yields exact computations of gradient covariance and NTK in the infinite width limit.…”
Section: The Equilibrium Hypothesismentioning
confidence: 99%
“…Gradient Covariance back-propagation. Analytical formulas for gradient covariance back-propagation were derived using this result, in (Hayou et al, 2019;Poole et al, 2016;Xiao et al, 2018;Yang, 2019a). Empirical results showed an excellent match for FFNN in , for Resnets in Yang (2019a) and for CNN in Xiao et al (2018).…”
Section: Conclusion and Limitationsmentioning
confidence: 99%
“…We follow the correspondence between NN and QFT pioneered by Halverson, Maiti and Stoner [1]. Its originality with respect to other approaches lies in the observation that, under very general conditions, NN with infinitely wide layers are described by a Gaussian process (GP) due to central limit theorem [38][39][40][41][42][43][44][45][46]. Realistic architectures never involve an infinite number of hyper-parameters N , and their behavior fail to be well described by a GP.…”
Section: Introduction and Outlinementioning
confidence: 99%