2021
DOI: 10.1137/19m1263443
|View full text |Cite
|
Sign up to set email alerts
|

Convergence and Dynamical Behavior of the ADAM Algorithm for Nonconvex Stochastic Optimization

Abstract: Adam is a popular variant of the stochastic gradient descent for finding a local minimizer of a function. The objective function is unknown but a random estimate of the current gradient vector is observed at each round of the algorithm. Assuming that the objective function is differentiable and non-convex, we establish the convergence in the long run of the iterates to a stationary point. The key ingredient is the introduction of a continuous-time version of Adam, under the form of a non-autonomous ordinary di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
45
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 52 publications
(45 citation statements)
references
References 20 publications
(26 reference statements)
0
45
0
Order By: Relevance
“…In the experiment, all weights are initialized randomly using N(0, 1) Gaussian distribution. According to the graphics card and memory conditions, set batch_size to 1. e initial learning rate is set to 0.001 using Adam optimization algorithm [30]. e Adam optimization algorithm calculates the gradient's first-order moment estimation and second-order moment estimation to adapt the learning rate.…”
Section: Experimental Setup and Environmentmentioning
confidence: 99%
“…In the experiment, all weights are initialized randomly using N(0, 1) Gaussian distribution. According to the graphics card and memory conditions, set batch_size to 1. e initial learning rate is set to 0.001 using Adam optimization algorithm [30]. e Adam optimization algorithm calculates the gradient's first-order moment estimation and second-order moment estimation to adapt the learning rate.…”
Section: Experimental Setup and Environmentmentioning
confidence: 99%
“…The idea of the adaptive step size is taken from the Adagrad algorithm introduced in [26]. Analysis of such algorithms for nonconvex objectives was proposed in [36,55,3,25] in the stochastic and smooth setting. To our knowledge the combination of adaptive step sizes with incremental methods has not been considered.…”
Section: Relation To Existing Litteraturementioning
confidence: 99%
“…Our nonsmooth convergence analysis relies on the ODE method, see [37] with many subsequent developments [6,33,7,17,3]. In particular we build uppon a nonsmooth ODE formulation, differential inclusions [22,2].…”
Section: Relation To Existing Litteraturementioning
confidence: 99%
“…Our starting point is a generic non-autonomous Ordinary Differential Equation (ODE) introduced by Belotto da Silva and Gazeau [9] (see also [8] for Adam), depicting the continuous-time versions of the aforementioned florilegium of algorithms. The solutions to the ODE are shown to converge to the set of critical points of F .…”
Section: Introductionmentioning
confidence: 99%