The loss landscape of overparameterized neural networks

Cooper, Yaim

doi:10.48550/arxiv.1804.10200

Cited by 26 publications

(27 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later this phenomanon is explained under generic assumptions by Kuditipudi et al (2019). Moreover, it has been proved that the local/global minimizers of an overparametrized network form a low-dimensional manifold (Cooper, 2018;2020) which possibly has many components. Fehrman et al (2020) proved the convergence rate of SGD to the manifold of local minimizers starting in a small neighborhood.…”

Section: Related Workmentioning

confidence: 98%

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Li¹,

Wang²,

Arora³

2021

Preprint

View full text Add to dashboard Cite

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. ( 2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇ 2 L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold-i.e., the "implicit bias"-using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η −2 steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for η −1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κ ln d) samples for learning an κ-sparse overparametrized linear model in R d (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous O(κ 2 ) upper bound (HaoChen et al., 2020).

show abstract

Section: Related Workmentioning

confidence: 98%

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Li¹,

Wang²,

Arora³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Another practical motivation for studying mode connectivity is to find better optima on the curve or through some ensemble technique. On the theory side, [77] proves that the locus of global minima of an overparameterized NN is a "connected submanifold". Another paper [78] studies a more general property on the connectivity of "sublevel sets" for deep linear NNs and one-hidden-layer ReLU networks.…”

Section: Related Workmentioning

confidence: 99%

Taxonomizing local versus global structure in neural network loss landscapes

Yang

Hodgkinson

Theisen

et al. 2021

Preprint

View full text Add to dashboard Cite

Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization performance). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy. Based on these results, we develop a simple one-dimensional model with load-like and temperature-like parameters, we introduce the notion of an effective loss landscape depending on these parameters, and we interpret our results in terms of a rugged convexity of the loss landscape. When viewed through this lens, our detailed empirical results shed light on phases of learning (and consequent double descent behavior), fundamental versus incidental determinants of good generalization, the role of load-like and temperature-like parameters in the learning process, different influences on the loss landscape from model and data, and the relationships between local and global metrics, all topics of recent interest.

show abstract

“…Concerning the training of neural networks via SGD we refer the reader to [BM11] [JW20]. Related target functions (loss landscapes) are analysed in [Coo18], [Ngu19], [Coo20], [PRV20] and [QZX20].…”

Section: Introductionmentioning

confidence: 99%

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

Dereich,

Kassing

2021

Preprint

View full text Add to dashboard Cite

In this article, we consider convergence of stochastic gradient descent schemes (SGD) under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays local we have convergence of the SGD if there is only a countable number of critical points or if the target function/landscape satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying local, if the random variables modeling the signal and response in the training are compactly supported.

show abstract

The loss landscape of overparameterized neural networks

Cited by 26 publications

References 0 publications

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Taxonomizing local versus global structure in neural network loss landscapes

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

Contact Info

Product

Resources

About