2017
DOI: 10.48550/arxiv.1705.07562
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the diffusion approximation of nonconvex stochastic gradient descent

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
38
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4

Relationship

2
6

Authors

Journals

citations
Cited by 27 publications
(38 citation statements)
references
References 14 publications
0
38
0
Order By: Relevance
“…A popular approach for analyzing SGD is based on considering SGD as a discretization of a continuous-time process [9,22,25,33,36,61]. This approach mainly requires the following assumption 1 on the stochastic gradient noise U k (w) ∇ fk (w) − ∇f (w):…”
Section: Introductionmentioning
confidence: 99%
“…A popular approach for analyzing SGD is based on considering SGD as a discretization of a continuous-time process [9,22,25,33,36,61]. This approach mainly requires the following assumption 1 on the stochastic gradient noise U k (w) ∇ fk (w) − ∇f (w):…”
Section: Introductionmentioning
confidence: 99%
“…5 Considering (9), if these distributions are different, then the change in generalization error would become η tr Σ train /N + η G train • (G test − G train ), and thus has an additional piece that depends on the difference between test and train mean gradients. Since the second term is still present, this doesn't falsify our result (13), but shows that this divergence of gradients is also an important factor in overfitting.…”
Section: Various Commentsmentioning
confidence: 45%
“…The prevailing explanation given for this empirical observation has to do with descriptions of the loss landscape [8] (also see e.g. [9,[11][12][13][14][15]), in particular that SGD prefers "flat" to "sharp" minima in which the loss doesn't change too much in the neighborhood of the minima. An argument originating from [16] ascribes good generalization properties to such flat minima, reasoning that the effect of changing inputs could be reinterpreted as shifting the location of the minima.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…A prolific technique for analyzing optimizations methods, both stochastic and deterministic, is the stochastic differential equations (SDE) paradigm [Chaudhari and Soatto, 2018, Hu et al, 2017, Jastrzebski et al, 2017, Kushner and Yin, 2003, Ljung, 1977, Mandt et al, 2016, Su et al, 2016. These SDEs relate to the dynamics of the optimization method by taking the limit when the stepsize goes to zero, so that the trajectory of the objective function over the lifetime of the algorithm converges to the solution of an SDE.…”
Section: Introductionmentioning
confidence: 99%