2021
DOI: 10.48550/arxiv.2106.15739
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Abstract: Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…However, we should be careful that these analysis does not apply to larger models that do not overfit the data. Second, this shows that the SDE modeling in [Li et al, 2020, Lobacheva et al, 2021 can also be valid. It also shows that our work studies a problem of a different nature (non-zero grad norm).…”
Section: Cifar10 Experimentsmentioning
confidence: 75%
See 2 more Smart Citations
“…However, we should be careful that these analysis does not apply to larger models that do not overfit the data. Second, this shows that the SDE modeling in [Li et al, 2020, Lobacheva et al, 2021 can also be valid. It also shows that our work studies a problem of a different nature (non-zero grad norm).…”
Section: Cifar10 Experimentsmentioning
confidence: 75%
“…These observations motivate us to rethink the convergence proofs used in classical optimization analysis. In addition, a few very recent results reported simialrly large oscillations in Cifar10 training [Li et al, 2020, Kunin et al, 2021, Lobacheva et al, 2021, though the authors focus on SDE approximation or batch normalization. Our work instead focuses on the connection to nonconvex optimization theorems.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation