2016
DOI: 10.48550/arxiv.1605.08361
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

No bad local minima: Data independent training error guarantees for multilayer neural networks

Abstract: We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
115
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 88 publications
(119 citation statements)
references
References 13 publications
4
115
0
Order By: Relevance
“…There are multiple recent attempts towards answering the above question and demystifying the success of deep learning. Soudry and Carmon (2016); Safran and Shamir (2016); Arora et al (2018a); Haeffele and Vidal (2015); Nguyen and Hein (2017) showed that over-parameterization can lead to better optimization landscape. Li and Liang (2018); Du et al (2019b) proved that with proper random initialization, gradient descent (GD) and/or stochastic gradient descent (SGD) provably find the global minimum for training over-parameterized one-hidden-layer ReLU networks.…”
Section: Introductionmentioning
confidence: 99%
“…There are multiple recent attempts towards answering the above question and demystifying the success of deep learning. Soudry and Carmon (2016); Safran and Shamir (2016); Arora et al (2018a); Haeffele and Vidal (2015); Nguyen and Hein (2017) showed that over-parameterization can lead to better optimization landscape. Li and Liang (2018); Du et al (2019b) proved that with proper random initialization, gradient descent (GD) and/or stochastic gradient descent (SGD) provably find the global minimum for training over-parameterized one-hidden-layer ReLU networks.…”
Section: Introductionmentioning
confidence: 99%
“…In addition, it is possible that all the local maxima are close to optimal. For example, similar observations were made regarding the loss surface of deep neural networks, but the local optima points were shown to be very good in practice [15,11,36], mitigating the issues mentioned above. Thus, we recommend taking Lemma 1 as an observation regarding the optimization landscape of DIAYN which we hope to further explore in future work.…”
Section: Diversity Via Discriminationmentioning
confidence: 71%
“…Instead of specific constructions, the following works discussed local minima for onehidden-layer ReLU networks by analyzing the conditions for their existence. [57] gave the conditions under which a differentiable local minimum has zero loss and thus is global. [32] showed that ReLU networks with hinge loss can only have non-differentiable local minima and gave the conditions for their existence for linear separable data.…”
Section: Related Workmentioning
confidence: 99%