On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Lobacheva, Ekaterina; Maxim, Kodryan,; Chirkova, Nadezhda; Malinin, Andrey; Vetrov, D. S.

doi:10.48550/arxiv.2106.15739

Cited by 2 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, we should be careful that these analysis does not apply to larger models that do not overfit the data. Second, this shows that the SDE modeling in [Li et al, 2020, Lobacheva et al, 2021 can also be valid. It also shows that our work studies a problem of a different nature (non-zero grad norm).…”

Section: Cifar10 Experimentsmentioning

confidence: 75%

“…These observations motivate us to rethink the convergence proofs used in classical optimization analysis. In addition, a few very recent results reported simialrly large oscillations in Cifar10 training [Li et al, 2020, Kunin et al, 2021, Lobacheva et al, 2021, though the authors focus on SDE approximation or batch normalization. Our work instead focuses on the connection to nonconvex optimization theorems.…”

Section: Related Workmentioning

confidence: 99%

“…On the right plot, we show the last a few iterations of transformerXL training (see section 6 for more details) and observe even larger oscillations. Furthermore, even in Cifar10 experiments, both Li et al [2020], Lobacheva et al [2021] show very strong periodic divergence in training loss when the number of training epoch is huge (> 1000 epochs). Therefore, we see that the loss looks smooth because the oscillation is much smaller than the longterm loss decrease due to optimization.…”

Section: An Explanatory Experimentsmentioning

confidence: 99%

See 2 more Smart Citations

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Zhang¹,

Li²,

Sra³

et al. 2021

Preprint

View full text Add to dashboard Cite

It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes. Remarkably, however, we observe that while weights do not converge to stationary points, the value of the loss function converges. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems. We prove convergence of the distribution of weight values to an approximate invariant measure (without smoothness and assumptions) that explains how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align the theory with empirical observations.

show abstract

Section: Cifar10 Experimentsmentioning

confidence: 75%

Section: Related Workmentioning

confidence: 99%

Section: An Explanatory Experimentsmentioning

confidence: 99%

See 1 more Smart Citation

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Zhang¹,

Li²,

Sra³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Loss Function Dynamics and Landscape for Deep Neural Networks Trained with Quadratic Loss

Nakhodnov¹,

Maxim²,

Lobacheva

et al. 2022

Dokl. Math.

View full text Add to dashboard Cite

Knowledge of the loss landscape geometry makes it possible to successfully explain the behavior of neural networks, the dynamics of their training, and the relationship between resulting solutions and hyperparameters, such as the regularization method, neural network architecture, or learning rate schedule. In this paper, the dynamics of learning and the surface of the standard cross-entropy loss function and the currently popular mean squared error (MSE) loss function for scale-invariant networks with normalization are studied. Symmetries are eliminated via the transition to optimization on a sphere. As a result, three learning phases with fundamentally different properties are revealed depending on the learning step on the sphere, namely, convergence phase, phase of chaotic equilibrium, and phase of destabilized learning. These phases are observed for both loss functions, but larger networks and longer learning for the transition to the convergence phase are required in the case of MSE loss.

show abstract

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Cited by 2 publications

References 9 publications

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Loss Function Dynamics and Landscape for Deep Neural Networks Trained with Quadratic Loss

Contact Info

Product

Resources

About