2020
DOI: 10.48550/arxiv.2007.14294
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A High Probability Analysis of Adaptive SGD with Momentum

Xiaoyu Li,
Francesco Orabona

Abstract: Stochastic Gradient Descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and momentum is the industry standard to train deep networks. Despite the enormous success of these methods, our theoretical understanding of these variants in the nonconvex setting is not complete, with most of the results only proving convergence in expectation and with strong assumptions on the stochastic gradients. In this paper, we present a high… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 6 publications
0
11
0
Order By: Relevance
“…where in the last inequality we used (21). From (22) and (20), with probability at least 1 − δ 1 − δ 2 , we obtain…”
Section: Proof Of Theoremmentioning
confidence: 99%
See 2 more Smart Citations
“…where in the last inequality we used (21). From (22) and (20), with probability at least 1 − δ 1 − δ 2 , we obtain…”
Section: Proof Of Theoremmentioning
confidence: 99%
“…[32] and [33] and [21] gave the high probability convergence guarantees under the Assumption 4. Intuitively, it results in that the tails of the noise distribution are dominated by tails of a Gaussian distribution.…”
Section: Convergence Analysis With High Probabilitymentioning
confidence: 99%
See 1 more Smart Citation
“…In particular, almost sure convergence to a first-order stationary point is proved assuming only strong smoothness and a weak assumption on the noise in [35]; mean convergence under the PL inequality is shown in, e.g, [37]. High-probability convergence results assuming strong smoothness and norm sub-Gaussian noise were provided in e.g., [38], and in [39] for strongly convex functions in the non-smooth setting.…”
Section: Introductionmentioning
confidence: 99%
“…We also acknowledge representative prior works on inexact gradient and proximal-gradient methods for batch optimization in, e.g., [32]- [34], and for the stochastic gradient descent in [35]- [38] (see also references therein). In particular, almost sure convergence to a first-order stationary point is proved assuming only strong smoothness and a weak assumption on the noise in [35]; mean convergence under the PL inequality is shown in, e.g, [37].…”
Section: Introductionmentioning
confidence: 99%