2020
DOI: 10.48550/arxiv.2002.11803
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Disentangling Adaptive Gradient Methods from Learning Rates

Abstract: We investigate several confounding factors in the evaluation of optimization algorithms for deep learning. Primarily, we take a deeper look at how adaptive gradient methods interact with the learning rate schedule, a notoriously difficult-to-tune hyperparameter which has dramatic effects on the convergence and generalization of neural network training. We introduce a "grafting" experiment which decouples an update's magnitude from its direction, finding that many existing beliefs in the literature may have ari… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 31 publications
0
6
0
Order By: Relevance
“…Indeed, rigorously proving causality for any such claim is extremely difficult -even in the hard sciences. Note that there are papers claiming that adaptive updates have worse generalization (Wilson et al, 2017); however, such claims have been recently partly confuted (see, e.g., Agarwal et al, 2020). On this note, despite its empirical success, we stress that Adam will not even converge on some convex functions (Reddi et al, 2018), thus making it hard to prove formal theoretical convergence and/or generalization guarantees.…”
Section: Discussionmentioning
confidence: 85%
See 1 more Smart Citation
“…Indeed, rigorously proving causality for any such claim is extremely difficult -even in the hard sciences. Note that there are papers claiming that adaptive updates have worse generalization (Wilson et al, 2017); however, such claims have been recently partly confuted (see, e.g., Agarwal et al, 2020). On this note, despite its empirical success, we stress that Adam will not even converge on some convex functions (Reddi et al, 2018), thus making it hard to prove formal theoretical convergence and/or generalization guarantees.…”
Section: Discussionmentioning
confidence: 85%
“…It has already been utilized to explain the success of AdaGrad (Orabona and Pál, 2018). Recently, Agarwal et al (2020) also provides theoretical and empirical support for setting the in the denominator of AdaGrad to be 0, thus making the update scale-free.…”
Section: Adamw Is Scale-free We Have Discussed What Advantages the Pr...mentioning
confidence: 99%
“…A multitude of rigorous analyses of AdaGrad, Adam and other adaptive methods have appeared in recent literature, notably [39,26,11]. However, fully understanding the theory and utility of adaptive methods remains an active research area, with diverse (and sometimes clashing) philosophies [40,32,2].…”
Section: Related Workmentioning
confidence: 99%
“…To investigate why Adam is responsible for breaking monotonicity further, we follow the "grafting" experiment described in Agarwal et al (2020), where two optimizers are combined by using the step magnitude from the first and step direction from the second. Results where we use the SGD step magnitude (which varies with LR) and Adam direction are shown in 7.…”
Section: C4 Optimizer Ablationsmentioning
confidence: 99%