Analysis of stochastic gradient descent in continuous time

Latz, Jonas

doi:10.1007/s11222-021-10016-8

Cited by 22 publications

(25 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a setting where the differential inclusions are actually differential equations and sufficiently smooth, one can sometimes show that ( x(t)) t≥0 → (x(t)) t≥0 in a weak sense, as λ → 0. We refer to [20,23] for results of this type and a general perspective on stochastic approximation in continuous time.…”

Section: Problem Setting and Motivationmentioning

confidence: 99%

“…see [23,Lemma 5] for details. The infinitesimal generator is the transition rate matrix that we give in Subsection 1.1, it has domain…”

Section: Feller Processes and Their Generatorsmentioning

confidence: 99%

“…Indeed, we aim to show that the stochastic approximations can approximate the deterministic dynamics at any accuracy. Results of this type have been discussed in [20,23] considering smooth stochastic optimisation and in [17] considering Markov chain Monte Carlo. The theoretical foundation is given by Kushner's perturbed test function theory [21,22].…”

Section: Approximation Propertiesmentioning

confidence: 99%

See 2 more Smart Citations

Gradient flows and randomised thresholding: sparse inversion and classification

Latz¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Sparse inversion and classification problems are ubiquitous in modern data science and imaging. They are often formulated as non-smooth minimisation problems. In sparse inversion, we minimise, e.g., the sum of a data fidelity term and an L1/LASSO regulariser. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg-Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems. Splitting techniques are much more useful: here, the target function is partitioned into a sum of two subtarget functions -each of which can be efficiently optimised. Splitting proceeds by performing optimisation steps alternately with respect to each of the two subtarget functions.In this work, we study splitting from a stochastic continuous-time perspective. Indeed, we define a differential inclusion that follows one of the two subtarget function's negative subgradient at each point in time. The choice of the subtarget function is controlled by a binary continuous-time Markov process. The resulting dynamical system is a stochastic approximation of the underlying subgradient flow. We investigate this stochastic approximation for an L1-regularised sparse inversion flow and for a discrete Allen-Cahn equation minimising a Ginzburg-Landau energy. In both cases, we study the longtime behaviour of the stochastic dynamical system and its ability to approximate the underlying subgradient flow at any accuracy. We illustrate our theoretical findings in a simple sparse estimation problem and also in a low-dimensional classification problem.

show abstract

Section: Problem Setting and Motivationmentioning

confidence: 99%

“…see [23,Lemma 5] for details. The infinitesimal generator is the transition rate matrix that we give in Subsection 1.1, it has domain…”

Section: Feller Processes and Their Generatorsmentioning

confidence: 99%

Section: Approximation Propertiesmentioning

confidence: 99%

See 1 more Smart Citation

Gradient flows and randomised thresholding: sparse inversion and classification

Latz¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We assume that we do not have direct access to the gradient ∇f (x) but to a random estimate ∇f (x, ξ), where ξ ∈ Ξ is random of law P. In the continuized framework, the randomness of the stochastic gradient and its time mix in a particularly convenient way. For similar reasons, Latz studied stochastic gradient descent as a gradient flow on a random function that is regenerated at a Poisson rate Latz [2021]. However, this approach has the same shortcomings as the other approaches based on gradient flows: the subsequent discretization introduces non-trivial errors.…”

Section: Continuized Nesterov Acceleration Of Stochastic Gradient Des...mentioning

confidence: 99%

A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip

Even¹,

Berthier²,

Bach³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters; and a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters. We provide continuized Nesterov acceleration under deterministic as well as stochastic gradients, with either additive or multiplicative noise. Finally, using our continuized framework and expressing the gossip averaging problem as the stochastic minimization of a certain energy function, we provide the first rigorous acceleration of asynchronous gossip algorithms.

show abstract

“…Another set of related literature is on the diffusion approximation of SGD (Li, Tai and Weinan, 2017;Feng, Li and Liu, 2017;Yang, Hu and Li, 2021;Sirignano and Spiliopoulos, 2020;Latz, 2021). Authors aim to approximate the trajectory of SGD by a diffusion process which solves an SDE.…”

mentioning

confidence: 99%

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Chen¹,

Mou²,

Maguluri³

2021

Preprint

View full text Add to dashboard Cite

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant step stochastic iterative algorithms do not converge asymptotically to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithms with smooth and strongly convex objective, (2) linear SA algorithms involving a Hurwitz matrix, and (3) nonlinear SA algorithms involving a contractive operator. When the iterate is scaled by 1/ √ α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an integral equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/ √ α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

show abstract

Analysis of stochastic gradient descent in continuous time

Cited by 22 publications

References 53 publications

Gradient flows and randomised thresholding: sparse inversion and classification

Gradient flows and randomised thresholding: sparse inversion and classification

A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Contact Info

Product

Resources

About