Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation

Feng, Yange; Gao, Tingran; Li, Lei; Liu, Jianguo; Lu, Yulong

doi:10.4310/cms.2020.v18.n1.a7

Cited by 6 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If F = 0 we obtain an equation in the original time with a higher order term, which is reminiscent of the stochastic modified equations (cf. [4,8])…”

Section: Hilbert Expansion and Stochastic Modified Equationsmentioning

confidence: 99%

Analysis of Kinetic Models for Label Switching and Stochastic Gradient Descent

Burger¹,

Rossi²

2022

Preprint

View full text Add to dashboard Cite

In this paper we provide a novel approach to the analysis of kinetic models for label switching, which are used for particle systems that can randomly switch between gradient flows in different energy landscapes. Besides problems in biology and physics, we also demonstrate that stochastic gradient descent, the most popular technique in machine learning, can be understood in this setting, when considering a time-continuous variant.Our analysis is focusing on the case of evolution in a collection of external potentials, for which we provide analytical and numerical results about the evolution as well as the stationary problem.

show abstract

“…If F = 0 we obtain an equation in the original time with a higher order term, which is reminiscent of the stochastic modified equations (cf. [4,8])…”

Section: Hilbert Expansion and Stochastic Modified Equationsmentioning

confidence: 99%

Analysis of Kinetic Models for Label Switching and Stochastic Gradient Descent

Burger¹,

Rossi²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, implicit regularization and backward error analysis has not been explored. Backward error analysis was used [54,55] to study stochastic gradient descent in the context of stochastic differential equations and diffusion equations for the study of convergence and adaptive learning schemes, but, to the best of our knowledge, it has not been used to explore implicit regularization in gradient descent.…”

Section: Related Workmentioning

confidence: 99%

Implicit Gradient Regularization

Barrett¹,

Dherin²

2020

Preprint

View full text Add to dashboard Cite

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent. * equal contribution Preprint. Under review.

show abstract

“…Our motivations, contributions and methods. It was recently discovered in [44,43,29,31,55,36,27,60,30,14] that SGD algorithms can be (weakly) approximated by continuous time SDEs. These SDEs often offer much needed insight to the algorithms under considerations, for instance, the continuous time treatment allows applications of stochastic control theory to develop novel adaptive algorithms [64,66].…”

Section: Introductionmentioning

confidence: 99%