No bad local minima: Data independent training error guarantees for multilayer neural networks

Soudry, Daniel; Carmon, Yair

doi:10.48550/arxiv.1605.08361

Cited by 88 publications

(119 citation statements)

References 13 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…There are multiple recent attempts towards answering the above question and demystifying the success of deep learning. Soudry and Carmon (2016); Safran and Shamir (2016); Arora et al (2018a); Haeffele and Vidal (2015); Nguyen and Hein (2017) showed that over-parameterization can lead to better optimization landscape. Li and Liang (2018); Du et al (2019b) proved that with proper random initialization, gradient descent (GD) and/or stochastic gradient descent (SGD) provably find the global minimum for training over-parameterized one-hidden-layer ReLU networks.…”

Section: Introductionmentioning

confidence: 99%

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Cao

2019

Preprint

View full text Add to dashboard Cite

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.

show abstract

Section: Introductionmentioning

confidence: 99%

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Cao

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition, it is possible that all the local maxima are close to optimal. For example, similar observations were made regarding the loss surface of deep neural networks, but the local optima points were shown to be very good in practice [15,11,36], mitigating the issues mentioned above. Thus, we recommend taking Lemma 1 as an observation regarding the optimization landscape of DIAYN which we hope to further explore in future work.…”

Section: Diversity Via Discriminationmentioning

confidence: 71%

Discovering Diverse Nearly Optimal Policies with Successor Features

Zahavy¹,

O’Donoghue²,

Barreto³

et al. 2021

Preprint

View full text Add to dashboard Cite

Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.Preprint. Under review.

show abstract

“…Instead of specific constructions, the following works discussed local minima for onehidden-layer ReLU networks by analyzing the conditions for their existence. [57] gave the conditions under which a differentiable local minimum has zero loss and thus is global. [32] showed that ReLU networks with hinge loss can only have non-differentiable local minima and gave the conditions for their existence for linear separable data.…”

Section: Related Workmentioning

confidence: 99%

Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

Liu¹

2021

Preprint

View full text Add to dashboard Cite

In this paper, it is shown theoretically that spurious local minima are common for deep fully-connected networks and convolutional neural networks (CNNs) with piecewise linear activation functions and datasets that cannot be fitted by linear models. A motivating example is given to explain the reason for the existence of spurious local minima: each output neuron of deep fully-connected networks and CNNs with piecewise linear activations produces a continuous piecewise linear (CPWL) output, and different pieces of CPWL output can fit disjoint groups of data samples when minimizing the empirical risk. Fitting data samples with different CPWL functions usually results in different levels of empirical risk, leading to prevalence of spurious local minima. This result is proved in general settings with any continuous loss function. The main proof technique is to represent a CPWL function as a maximization over minimization of linear pieces. Deep ReLU networks are then constructed to produce these linear pieces and implement maximization and minimization operations.

show abstract

No bad local minima: Data independent training error guarantees for multilayer neural networks

Cited by 88 publications

References 13 publications

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Discovering Diverse Nearly Optimal Policies with Successor Features

Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

Contact Info

Product

Resources

About