2021
DOI: 10.1609/aaai.v35i8.16837
|View full text |Cite
|
Sign up to set email alerts
|

Understanding Decoupled and Early Weight Decay

Abstract: Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an l2 penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
39
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 36 publications
(39 citation statements)
references
References 22 publications
0
39
0
Order By: Relevance
“…This idea of allowing a hyper-parameter value to change during training (in replacement for a learning rate schedule) has been extended to other hyper-parameters, such as weight decay [13][14][15][16], and batch sizes [17]. Previous work [20], has demonstrated empirically a relationship between the optimal hyper-parameters of learning rate (LR), weight decay (WD), batch size (BS), and momentum (m) as…”
Section: Hyper-parameters and Regularizationmentioning
confidence: 99%
See 1 more Smart Citation
“…This idea of allowing a hyper-parameter value to change during training (in replacement for a learning rate schedule) has been extended to other hyper-parameters, such as weight decay [13][14][15][16], and batch sizes [17]. Previous work [20], has demonstrated empirically a relationship between the optimal hyper-parameters of learning rate (LR), weight decay (WD), batch size (BS), and momentum (m) as…”
Section: Hyper-parameters and Regularizationmentioning
confidence: 99%
“…General cyclical training provides an intuitive understanding for the value of a one cycle training regime. Furthermore, this idea of allowing a hyper-parameter value to change during training has been extended to other hyperparameters, such as weight decay [13][14][15][16], and batch sizes [17].…”
Section: Introductionmentioning
confidence: 99%
“…Additionally, different layers are added to prevent the problem of overfitting, such as dropout layers and normalization layer that keeps the mean close to 0 and the standard deviation close to 1 for the output. This layer will hence accelerate training [13].…”
Section: Fully Connected Layer (Fc)mentioning
confidence: 99%
“…Batch normalization is a technique to standardize activations in intermediate layers of deep neural networks across minibatches. It has demonstrated improved accuracies and faster convergences due to its stabilization of the learning process [30]. Additionally, introducing batch normalization allows the in and outputs of the regression model to remain unscaled, thus retaining the hierarchical structure of the coherency-loss function.…”
Section: Regressor Designmentioning
confidence: 99%