2022
DOI: 10.48550/arxiv.2202.00089
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding AdamW through Proximal Methods and Scale-Freeness

Abstract: Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared 2 regularizer (referred to as Adam-2 ). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-2 . Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an opti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 26 publications
0
13
0
Order By: Relevance
“…Decay Weight by Proximation. As observed in AdamW [3,57], decoupling the optimization objective and simple-type regularization (e.g. 2 regularizer) can largely improve the generalization performance.…”
Section: Adaptive Nesterov Momentum Algorithmmentioning
confidence: 98%
“…Decay Weight by Proximation. As observed in AdamW [3,57], decoupling the optimization objective and simple-type regularization (e.g. 2 regularizer) can largely improve the generalization performance.…”
Section: Adaptive Nesterov Momentum Algorithmmentioning
confidence: 98%
“…However, this method heavily relies on the choice of optimization algorithm [13], as there is a risk of getting stuck in local minima during gradient computation, leading to the model's inability to learn and improve prediction/recognition quality (vanishing gradients). To address this, we employed the AdamW [14] optimizer, one of the state-of-the-art methods, which leverages information about the learning rate history to approximate the direction of the anti gradient while incorporating momentum to expedite the convergence of our function. This optimizer significantly improves model training; however, it is sensitive to the choice of the learning rate.…”
Section: Transformer Modelsmentioning
confidence: 99%
“…This article makes some fine-tuning to the BERT [4] model, adjusting the learning rate to 1 × 10 −5 , batchsize to 12, epochs to 10, and the optimizer uses AdamW [7,8] to run the data in data parallel to accelerate the model.…”
Section: Implementation Detailsmentioning
confidence: 99%