Gradient Descent Quantizes ReLU Network Features

Maennel, Hartmut; Bousquet, Olivier; Gelly, Sylvain

doi:10.48550/arxiv.1803.08367

Cited by 25 publications

(41 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A trade-off between the scale of the initialization and the training regime is also provided in [WTS + 19, SPD + 20]. [MBG18] proves that the gradient flow enforces the weight vectors to concentrate at a small number of directions determined by the input data. Through the lens of spline theory, [PN20b] explains that a number of best practices used in deep learning, such as weight decay and path-norm, are connected to the ReLU activation and its smooth-counterparts.…”

Section: Related Workmentioning

confidence: 96%

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Шевченко¹,

Kungurtsev²,

Mondelli³

2021

Preprint

View full text Add to dashboard Cite

Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for a univariate regularized regression problem. Our main result is that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points -i.e., points where the tangent of the ReLU network estimator changes -between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent the "knot" points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory.

show abstract

Section: Related Workmentioning

confidence: 96%

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Шевченко¹,

Kungurtsev²,

Mondelli³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, the Frequency Principle (Xu et al, 2019(Xu et al, , 2020 states that NNs often fit target functions from low to high frequencies during the training. A series of works study the mechanism of condensation at an initial training stage, such as for ReLU network (Maennel et al, 2018;Pellegrini and Biroli, 2020) and network with continuously differentiable activation functions (Xu et al, 2021). This work in some sense serves as our attempt to uncover the theoretical structure underlying the condensation phenomenon from the perspective of loss function by proving a general Embedding Principle.…”

Section: Related Workmentioning

confidence: 99%

“…The frequency principle (Xu et al, 2019(Xu et al, , 2020Rahaman et al, 2019; shows that NNs, over-parameterized or not, tend to fit the training data by a low-frequency function, which suggests that the learned function by an NN is often of much lower complexity than the NN's capacity. Specifically, with small initialization, e.g., in a condensed regime, weights of an NN are empirically found to condense on isolated directions resulting in an output function mimicking that of a narrower NN Maennel et al, 2018). These observations raise a question that in which sense learning of a wide NN is not drastically different from a narrower NN despite potentially huge difference in their numbers of parameters.…”

Section: Introductionmentioning

confidence: 99%

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) that unravels a hierarchical structure of the loss landscape of NNs, i.e., loss landscape of an NN contains all critical points of all the narrower NNs. This result is obtained by constructing a class of critical embeddings which map any critical point of a narrower NN to a critical point of the target NN with the same output function. By discovering a wide class of general compatible critical embeddings, we provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs. We further prove an irreversiblility property of any critical embedding that the number of negative/zero/positive eigenvalues of the Hessian matrix of a critical point may increase but never decrease as an NN becomes wider through the embedding. Using a special realization of general compatible critical embedding, we prove a stringent necessary condition for being a "truly-bad" critical point that never becomes a strict-saddle point through any critical embedding. This result implies the commonplace of strict-saddle points in wide NNs, which may be an important reason underlying the easy optimization of wide NNs widely observed in practice.

show abstract

“…Several theoretical works studying neural network training with small initialization can be connected to simplicity bias. Maennel et al (2018) uncovered a weight quantization effect in training two-layer nets with small initialization: gradient flow biases the weight vectors to a certain number of directions determined by the input data (independent of neural network width). It is hence argued that gradient flow has a bias towards "simple" functions, but their proof is not entirely rigorous and no clear definition of simplicity is given.…”

Section: Related Workmentioning

confidence: 99%

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

Lyu¹,

Li²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where analysis was more successful), and a recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and Telgarsky, 2020) provide theoretical evidence that GD may converge to the "max-margin" solution with zero loss, which presumably generalizes well. However, the global optimality of margin is proved only in some settings where neural nets are infinitely or exponentially wide. The current paper is able to establish this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width. The analysis also gives some theoretical justification for recent empirical findings (Kalimeris et al., 2019) on the so-called simplicity bias of GD towards linear or other "simple" classes of solutions, especially early in training. On the pessimistic side, the paper suggests that such results are fragile. A simple data manipulation can make gradient flow converge to a linear classifier with suboptimal margin. * Equal contribution † Most of the work is done when Kaifeng Lyu and Runzhe Wang were at Tsinghua University.35th Conference on Neural Information Processing Systems (NeurIPS 2021).

show abstract

Gradient Descent Quantizes ReLU Network Features

Cited by 25 publications

References 2 publications

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Embedding Principle: a hierarchical structure of loss landscape of deep neural networks

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

Contact Info

Product

Resources

About