Disentangling Adaptive Gradient Methods from Learning Rates

Agarwal, Naman; Anil, Rohan; Hazan, Elad; Koren, Tomer; Zhang, Cyril

doi:10.48550/arxiv.2002.11803

Cited by 6 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, rigorously proving causality for any such claim is extremely difficult -even in the hard sciences. Note that there are papers claiming that adaptive updates have worse generalization (Wilson et al, 2017); however, such claims have been recently partly confuted (see, e.g., Agarwal et al, 2020). On this note, despite its empirical success, we stress that Adam will not even converge on some convex functions (Reddi et al, 2018), thus making it hard to prove formal theoretical convergence and/or generalization guarantees.…”

Section: Discussionmentioning

confidence: 85%

See 1 more Smart Citation

Understanding AdamW through Proximal Methods and Scale-Freeness

Zhuang¹,

Liu²,

Cutkosky³

et al. 2022

Preprint

View full text Add to dashboard Cite

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared 2 regularizer (referred to as Adam-2 ). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-2 . Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-2 . Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-2 and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.

show abstract

Section: Discussionmentioning

confidence: 85%

“…It has already been utilized to explain the success of AdaGrad (Orabona and Pál, 2018). Recently, Agarwal et al (2020) also provides theoretical and empirical support for setting the in the denominator of AdaGrad to be 0, thus making the update scale-free.…”

Section: Adamw Is Scale-free We Have Discussed What Advantages the Pr...mentioning

confidence: 99%

Understanding AdamW through Proximal Methods and Scale-Freeness

Zhuang¹,

Liu²,

Cutkosky³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A multitude of rigorous analyses of AdaGrad, Adam and other adaptive methods have appeared in recent literature, notably [39,26,11]. However, fully understanding the theory and utility of adaptive methods remains an active research area, with diverse (and sometimes clashing) philosophies [40,32,2].…”

Section: Related Workmentioning

confidence: 99%

Adaptive Gradient Methods with Local Guarantees

Zhang¹,

Xia²,

Arora³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods.We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.

show abstract

“…To investigate why Adam is responsible for breaking monotonicity further, we follow the "grafting" experiment described in Agarwal et al (2020), where two optimizers are combined by using the step magnitude from the first and step direction from the second. Results where we use the SGD step magnitude (which varies with LR) and Adam direction are shown in 7.…”

Section: C4 Optimizer Ablationsmentioning

confidence: 99%

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

Lucas,

Bae,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Linear interpolation between initial neural network parameters and converged parameters after training with stochastic gradient descent (SGD) typically leads to a monotonic decrease in the training objective. This Monotonic Linear Interpolation (MLI) property, first observed by Goodfellow et al. (2014), persists in spite of the non-convex objectives and highly non-linear training dynamics of neural networks. Extending this work, we evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network -providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization. The MLI property raises important questions about the loss landscape geometry of neural networks and highlights the need to further study their global properties.

show abstract

Disentangling Adaptive Gradient Methods from Learning Rates

Cited by 6 publications

References 31 publications

Understanding AdamW through Proximal Methods and Scale-Freeness

Understanding AdamW through Proximal Methods and Scale-Freeness

Adaptive Gradient Methods with Local Guarantees

Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes

Contact Info

Product

Resources

About