2019
DOI: 10.4310/amsa.2019.v4.n1.a1
|View full text |Cite
|
Sign up to set email alerts
|

On the diffusion approximation of nonconvex stochastic gradient descent

Abstract: We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp. sadd… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
52
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
1
1

Relationship

3
6

Authors

Journals

citations
Cited by 55 publications
(53 citation statements)
references
References 12 publications
1
52
0
Order By: Relevance
“…RBM-1 is in spirit similar to the stochastic gradient descent in machine learning ( [21,22]). Recently, there are some analysis of SGD in the mathematical viewpoints and applications to physical problems [44,45,46,47]. RBM-1 can be used both for simulating the evolution of the measure (1.5) or (1.4) and for sampling from the equilibrium state π.…”
mentioning
confidence: 99%
“…RBM-1 is in spirit similar to the stochastic gradient descent in machine learning ( [21,22]). Recently, there are some analysis of SGD in the mathematical viewpoints and applications to physical problems [44,45,46,47]. RBM-1 can be used both for simulating the evolution of the measure (1.5) or (1.4) and for sampling from the equilibrium state π.…”
mentioning
confidence: 99%
“…approximates the SGD iteration with weak error O(η 2 ), where W is a Wiener process, · denotes the Euclidean 2-norm 1 , and where BB T = Σ. Dropping the small term proportional to η reduces the weak error to O(η) (Hu et al, 2017). This leads to the SDE…”
Section: Stochastic Gradient Descent In Continuous-timementioning
confidence: 99%
“…In the context of data science, diffusion approximation has been used to gain insights into online PCA [20], entropy-SGD [36,37], and nonconvex optimization [21], to name just a few. Despite its effectiveness as a continuous analogy of stochastic numerical optimization algorithms, the range of applicability of diffusion approximation is significantly limited by its restricted validity in a finite time interval.…”
Section: Main Contribution: Long-time Weak Approximation For Sgd Via Sdementioning
confidence: 99%
“…Stochastic gradient descent (SGD) is a prototypical stochastic optimization algorithm widely used for solving large scale data science problems [1,2,3,4,5,6], not only for its scalability to large datasets, but also due to its surprising capability of identifying parameters of deep neural network models with better generalization behavior than adaptive gradient methods [7,8,9]. The past decade has witnessed growing interests in accelerating this simple yet powerful optimization scheme [10,11,12,13,14,15], as well as better understanding its dynamics, through the lens of either discrete Markov chains [16,17] or continuous stochastic differential equations [18,19,20,21]. This paper introduces new techniques into the theoretical framework of diffusion approximation, which provides weak approximation to SGD algorithms through the solution of a modified stochastic differential equation (SDE).…”
Section: Introductionmentioning
confidence: 99%