Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Yang, Greg

doi:10.48550/arxiv.1902.04760

Cited by 111 publications

(202 citation statements)

References 56 publications

Supporting

Mentioning

196

Contrasting

Order By: Relevance

“…Several existing works (Vaskevicius et al, 2019;Woodworth et al, 2020;Zhao et al, 2019) have shown that for the quadratically overparametrized linear model, i.e., w = u 2 − v 2 or w = u v, gradient descent/flow from small initialization implicitly regularizes 1 norm and provides better generalization when the groundtruth is sparse. This is in sharp contrast to the kernel regime, where neural networks trained by gradient descent behaves like kernel methods (Daniely, 2017;Jacot et al, 2018;Yang, 2019). This allows one to prove convergence to zero loss solutions in overparametrized settings Du et al, 2018;Allen-Zhu et al, 2019b;a;Du et al, 2019), where the learnt function minimizes the corresponding RKHS norm (Arora et al, 2019b;Chizat et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Li¹,

Wang²,

Arora³

2021

Preprint

View full text Add to dashboard Cite

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. ( 2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇ 2 L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold-i.e., the "implicit bias"-using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η −2 steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for η −1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κ ln d) samples for learning an κ-sparse overparametrized linear model in R d (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous O(κ 2 ) upper bound (HaoChen et al., 2020).

show abstract

Section: Related Workmentioning

confidence: 99%

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Li¹,

Wang²,

Arora³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the literature on signal propagation at initialization (e.g. Poole et al, 2016;Hayou et al, 2019;Yang and Schoenholz, 2017;Yang, 2019a;Xiao et al, 2018)), results on gradient backpropagation rely on the assumption that the weights used for backpropagation are independent from the ones used for forward propagation. Yang (2020) showed that this assumption yields exact computations of gradient covariance and NTK in the infinite width limit.…”

Section: The Equilibrium Hypothesismentioning

confidence: 99%

“…Gradient Covariance back-propagation. Analytical formulas for gradient covariance back-propagation were derived using this result, in (Hayou et al, 2019;Poole et al, 2016;Xiao et al, 2018;Yang, 2019a). Empirical results showed an excellent match for FFNN in , for Resnets in Yang (2019a) and for CNN in Xiao et al (2018).…”

Section: Conclusion and Limitationsmentioning

confidence: 99%

Feature Learning and Signal Propagation in Deep Neural Networks

Yizhang¹,

Mingard²,

Nam³

et al. 2021

Preprint

View full text Add to dashboard Cite

Modern Deep Neural Networks (DNNs) exhibit impressive generalization properties on a variety of tasks without explicit regularization, suggesting the existence of hidden regularization effects. Recent work by Baratin et al. (2021) sheds light on an intriguing implicit regularization effect, showing that some layers are much more aligned with data labels than other layers. This suggests that as the network grows in depth and width, an implicit layer selection phenomenon occurs during training. In this work, we provide the first explanation for this alignment hierarchy. We introduce and empirically validate the Equilibrium Hypothesis which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels. Our experiments demonstrate an excellent match with the theoretical predictions.

show abstract

“…We follow the correspondence between NN and QFT pioneered by Halverson, Maiti and Stoner [1]. Its originality with respect to other approaches lies in the observation that, under very general conditions, NN with infinitely wide layers are described by a Gaussian process (GP) due to central limit theorem [38][39][40][41][42][43][44][45][46]. Realistic architectures never involve an infinite number of hyper-parameters N , and their behavior fail to be well described by a GP.…”

Section: Introduction and Outlinementioning

confidence: 99%

Nonperturbative renormalization for the neural network-QFT correspondence

Erbin,

Lahoche,

Samary

2021

Preprint

View full text Add to dashboard Cite

In a recent work [1], Halverson, Maiti and Stoner proposed a description of neural networks in terms of a Wilsonian effective field theory. The infinite-width limit is mapped to a free field theory while finite N corrections are taken into account by interactions (non-Gaussian terms in the action). In this paper, we study two related aspects of this correspondence. First, we comment on the concepts of locality and power-counting in this context. Indeed, these usual space-time notions may not hold for neural networks (since inputs can be arbitrary), however, the renormalization group provides natural notions of locality and scaling. Moreover, we comment on several subtleties, for example, that data components may not have a permutation symmetry: in that case, we argue that random tensor field theories could provide a natural generalization. Second, we improve the perturbative Wilsonian renormalization from [1] by providing an analysis in terms of the nonperturbative renormalization group using the Wetterich-Morris equation. An important difference with usual nonperturbative RG analysis is that only the effective (IR) 2-point function is known, which requires setting the problem with care. Our aim is to provide a useful formalism to investigate neural networks behavior beyond the large-width limit (i.e. far from Gaussian limit) in a nonperturbative fashion. A major result of our analysis is that changing the standard deviation of the neural network weight distribution can be interpreted as a renormalization flow in the space of networks. We focus on translations invariant kernels and provide preliminary numerical results.

show abstract

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Cited by 111 publications

References 56 publications

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Feature Learning and Signal Propagation in Deep Neural Networks

Nonperturbative renormalization for the neural network-QFT correspondence

Contact Info

Product

Resources

About