On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization

Kidambi, Rahul; Netrapalli, Praneeth; Jain, Prateek; Kakade, Sham M.

doi:10.1109/ita.2018.8503173

Cited by 65 publications

(93 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As we show in this work, both theoretically and empirically, (small batch size) SGD with heavyball momentum (sHB) for any fixed momentum parameter does not provide any acceleration over plain SGD on large-scale least square problems. We conclude under an identification of the parameters that f (x shb k ) = f (x sgd k ) for all k up to errors that vanish as n grows large (upper bounds of this nature have been observed before: see Kidambi et al [2018], Sebbouh et al [2020], Zhang et al [2019]). Thus while sHB may provide a speed-up over SGD, it is only due to an effective increase in the learning rate, and this speed-up could be matched by appropriately adjusting the learning rate of SGD.…”

supporting

confidence: 58%

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Paquette¹,

Paquette²

2021

Preprint

View full text Add to dashboard Cite

We analyze a class of stochastic gradient algorithms with momentum on a highdimensional random least squares problem. Our framework, inspired by random matrix theory, provides an exact (deterministic) characterization for the sequence of loss values produced by these algorithms which is expressed only in terms of the eigenvalues of the Hessian. This leads to simple expressions for nearly-optimal hyperparameters, a description of the limiting neighborhood, and average-case complexity. As a consequence, we show that (small-batch) stochastic heavy-ball momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. For contrast, in the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum. By introducing hyperparameters that depend on the number of samples, we propose a new algorithm sDANA (stochastic dimension adjusted Nesterov acceleration) which obtains an asymptotically optimal average-case complexity while remaining linearly convergent in the strongly convex setting without adjusting parameters.Methods that incorporate momentum and acceleration play an integral role in machine learning where they are often combined with stochastic gradients. Two of the most popular methods in this category are the heavy-ball method (HB) [Polyak, 1964] and Nesterov's accelerated method (NAG) [Nesterov, 2004]. These methods are known to achieve optimal convergence guarantees when employed with exact gradients (computed on the full training data set), but in practice, these momentum methods are typically implemented with stochastic gradients. In the influential work Sutskever et al. [2013], the authors demonstrated empirical advantages of augmenting stochastic gradient descent (SGD) with the momentum machinery and, as a result, momentum methods are widely used for training deep neural networks. Yet despite the popularity of these stochastic momentum methods, the theoretical understanding of these algorithms remains rather limited.

show abstract

supporting

confidence: 58%

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Paquette¹,

Paquette²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The variance of stochastic gradients is detrimental to SGD, motivating variance reduction techniques [22][23][24][25][26][27][28] that aim to reduce the variance incurred due to their stochastic process of estimation, and improve the convergence rate mainly for convex optimization, while some are extended to non-convex problems [29][30][31]. One of the most practical algorithms for better convergence rates includes momentum [32], modified momentum for accelerated gradient [33], and stochastic estimation of accelerated gradient descent [34]. These algorithms are more focused on the efficiency in convergence than the generalization of models for accuracy.…”

Section: Variance Reductionmentioning

confidence: 99%

Learning-Rate Annealing Methods for Deep Neural Networks

et al. 2021

View full text Add to dashboard Cite

Deep neural networks (DNNs) have achieved great success in the last decades. DNN is optimized using the stochastic gradient descent (SGD) with learning rate annealing that overtakes the adaptive methods in many tasks. However, there is no common choice regarding the scheduled-annealing for SGD. This paper aims to present empirical analysis of learning rate annealing based on the experimental results using the major data-sets on the image classification that is one of the key applications of the DNNs. Our experiment involves recent deep neural network models in combination with a variety of learning rate annealing methods. We also propose an annealing combining the sigmoid function with warmup that is shown to overtake both the adaptive methods and the other existing schedules in accuracy in most cases with DNNs.

show abstract

“…The inclusion of Nesterov's momentum in deterministic step provides a potential of stabilization for stochastic variance-reduced algorithms. However, the acceleration gained with Nesterov's momentum in stochastic steps is mainly the by-product of mini-batching [28]. This double acceleration in DASVRDA [24] is possibly limited to the setting of mini-batching, which prevents us to develop a stochastic algorithm in the non-mini-batch setting.…”

Section: B Momentum Accelerationmentioning

confidence: 99%

“…The popularity and wide applicability of momentum methods in stochastic setting are that they mimic their deterministic gradient counterparts, which result in some practical gains. However, these gains in stochastic cases are particularly the by-products of mini-batching [28]. For instance, Nesterov's momentum is applied in the pure stochastic setting with a batch size of 1 in Acc-Prox-SVRG [18], wherein no acceleration can be guaranteed in theory.…”

Section: A Double Accelerationmentioning

confidence: 99%

Stochastic Momentum Method With Double Acceleration for Regularized Empirical Risk Minimization

2019

View full text Add to dashboard Cite

Momentum acceleration technique is famously known for building gradient-based algorithms with fast convergence in large-scale optimization. Recently, Nesterov's momentum and Katyusha momentum have significantly improved the convergence for stochastic optimization problems. However, the practical gain of acceleration with Nesterov's momentum is mainly a by-product of mini-batching, while acceleration merely with Katyusha momentum in stochastic steps would make the optimization unstable. In this paper, we build a stochastic and doubly accelerated momentum method (SDAMM) which incorporates the Nesterov's momentum and Katyusha momentum in the framework of variance reduction, to stabilize the accelerated algorithm and reduce the dependence on the mini-batching. Theoretically, SDAMM achieves the best-known convergence rates for convex objectives. The experimental results demonstrate that our SDAMM is competitive with state-of-the-art methods for the optimization problems in machine learning.

show abstract

On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization

Cited by 65 publications

References 20 publications

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Learning-Rate Annealing Methods for Deep Neural Networks

Stochastic Momentum Method With Double Acceleration for Regularized Empirical Risk Minimization

Contact Info

Product

Resources

About