2018 Information Theory and Applications Workshop (ITA) 2018
DOI: 10.1109/ita.2018.8503173
|View full text |Cite
|
Sign up to set email alerts
|

On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization

Abstract: Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
91
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 65 publications
(93 citation statements)
references
References 20 publications
2
91
0
Order By: Relevance
“…As we show in this work, both theoretically and empirically, (small batch size) SGD with heavyball momentum (sHB) for any fixed momentum parameter does not provide any acceleration over plain SGD on large-scale least square problems. We conclude under an identification of the parameters that f (x shb k ) = f (x sgd k ) for all k up to errors that vanish as n grows large (upper bounds of this nature have been observed before: see Kidambi et al [2018], Sebbouh et al [2020], Zhang et al [2019]). Thus while sHB may provide a speed-up over SGD, it is only due to an effective increase in the learning rate, and this speed-up could be matched by appropriately adjusting the learning rate of SGD.…”
supporting
confidence: 58%
“…As we show in this work, both theoretically and empirically, (small batch size) SGD with heavyball momentum (sHB) for any fixed momentum parameter does not provide any acceleration over plain SGD on large-scale least square problems. We conclude under an identification of the parameters that f (x shb k ) = f (x sgd k ) for all k up to errors that vanish as n grows large (upper bounds of this nature have been observed before: see Kidambi et al [2018], Sebbouh et al [2020], Zhang et al [2019]). Thus while sHB may provide a speed-up over SGD, it is only due to an effective increase in the learning rate, and this speed-up could be matched by appropriately adjusting the learning rate of SGD.…”
supporting
confidence: 58%
“…The variance of stochastic gradients is detrimental to SGD, motivating variance reduction techniques [22][23][24][25][26][27][28] that aim to reduce the variance incurred due to their stochastic process of estimation, and improve the convergence rate mainly for convex optimization, while some are extended to non-convex problems [29][30][31]. One of the most practical algorithms for better convergence rates includes momentum [32], modified momentum for accelerated gradient [33], and stochastic estimation of accelerated gradient descent [34]. These algorithms are more focused on the efficiency in convergence than the generalization of models for accuracy.…”
Section: Variance Reductionmentioning
confidence: 99%
“…The inclusion of Nesterov's momentum in deterministic step provides a potential of stabilization for stochastic variance-reduced algorithms. However, the acceleration gained with Nesterov's momentum in stochastic steps is mainly the by-product of mini-batching [28]. This double acceleration in DASVRDA [24] is possibly limited to the setting of mini-batching, which prevents us to develop a stochastic algorithm in the non-mini-batch setting.…”
Section: B Momentum Accelerationmentioning
confidence: 99%
“…The popularity and wide applicability of momentum methods in stochastic setting are that they mimic their deterministic gradient counterparts, which result in some practical gains. However, these gains in stochastic cases are particularly the by-products of mini-batching [28]. For instance, Nesterov's momentum is applied in the pure stochastic setting with a batch size of 1 in Acc-Prox-SVRG [18], wherein no acceleration can be guaranteed in theory.…”
Section: A Double Accelerationmentioning
confidence: 99%