2020
DOI: 10.48550/arxiv.2002.04710
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unique Properties of Flat Minima in Deep Networks

Abstract: It is well known that (stochastic) gradient descent has an implicit bias towards wide minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the widest of all minima. We then prove that these minima corr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 11 publications
0
2
0
Order By: Relevance
“…Our work extends the bulk of literature concerning mathematical characterization of the implicit regularization induced by gradient-based optimization. 5 Existing characterizations focus on different aspects of learning, for example: dynamics of optimization ( [1,25,45,6,28,41,26]); curvature ("flatness") of obtained minima ( [52]); frequency spectrum of learned input-output mappings ( [61]); invariant quantities throughout training ( [24]); and statistical properties imported from data ( [10]). A ubiquitous approach, arguably more prevalent than the aforementioned, is to demonstrate that learned input-output mappings minimize some notion of norm, or analogously, maximize some notion of margin.…”
Section: Related Workmentioning
confidence: 99%
“…Our work extends the bulk of literature concerning mathematical characterization of the implicit regularization induced by gradient-based optimization. 5 Existing characterizations focus on different aspects of learning, for example: dynamics of optimization ( [1,25,45,6,28,41,26]); curvature ("flatness") of obtained minima ( [52]); frequency spectrum of learned input-output mappings ( [61]); invariant quantities throughout training ( [24]); and statistical properties imported from data ( [10]). A ubiquitous approach, arguably more prevalent than the aforementioned, is to demonstrate that learned input-output mappings minimize some notion of norm, or analogously, maximize some notion of margin.…”
Section: Related Workmentioning
confidence: 99%
“…For the trained NNs on such synthetic data to generalize to real (application) data, the synthetic training set has to include the features embedded in the real data as much as possible (Kouw, 2018). For one, the training dataset (inputs and labels) should be represented by distributions that include the input and expected labels lead the training to non-sharp (flat) local minima for the real data (Mulayoff and Michaeli, 2020). However, this requirement, especially with respect to the input data to the network, is hard to achieve considering the simplified assumptions we use in modeling and simulation.…”
Section: Introductionmentioning
confidence: 99%