On the linear convergence of the stochastic gradient method with constant step-size

Cevher, Volkan; Vũ, Bằng Công

doi:10.1007/s11590-018-1331-1

Cited by 18 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, in practice VR methods do not tend to converge faster than SGD on over-parameterized models [19]. Indeed, recent works [83,52,5,47,13,33,73] have shown that when training over-parameterized models, classic SGD with a constant step-size and without VR can achieve the convergence rates of full-batch gradient descent. These works assume that the model is expressive enough to interpolate the data.…”

mentioning

confidence: 99%

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Vaswani,

Mishkin,

Laradji

et al. 2019

Preprint

View full text Add to dashboard Cite

Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD's practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters. For multi-class classification using deep networks, SGD with Armijo line-search results in both faster convergence and better generalization. IntroductionStochastic gradient descent (SGD) and its variants [21,87,39,82,72,35,18] are the preferred optimization methods in modern machine learning. They only require the gradient for one training example (or a small "mini-batch" of examples) in each iteration and thus can be used with large datasets. These first-order methods have been particularly successful for training highly-expressive, over-parameterized models such as non-parametric regression [45,7] and deep neural networks [9,88]. However, the practical efficiency of stochastic gradient methods is adversely affected by two challenges: (i) their performance heavily relies on the choice of the step-size ("learning rate") [9,70] and (ii) their slow convergence compared to methods that compute the full gradient (over all training examples) in each iteration [58].

show abstract

mentioning

confidence: 99%

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Vaswani,

Mishkin,

Laradji

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…for M = 0 we recover the uniformly bounded noise assumption. Furthermore, it has been proved that this property always holds under certain assumptions (Cevher and Vũ, 2019).…”

Section: Stochastic Noisementioning

confidence: 96%

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

Stich¹,

Mohtashami²,

Jäggi³

2021

Preprint

View full text Add to dashboard Cite

It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and-in asynchronous implementations-on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.

show abstract

“…Vaswani et al (2019) propose to use line-search to set the step-size while training over-parameterized models which can fit completely to data. Several other works propose to use constant learning rate for stochastic gradient methods (Ma et al, 2017;Bassily et al, 2018;Liu & Belkin, 2018;Cevher & Vũ, 2019) while training extremely expressive models which interpolate. However, all of the above mentioned works are primal-based algorithms.…”

Section: Related Workmentioning

confidence: 99%

Explicit Regularization of Stochastic Gradient Methods through Duality

Raj,

Bach

2020

Preprint

View full text Add to dashboard Cite

We consider stochastic gradient methods under the interpolation regime where a perfect fit can be obtained (minimum loss at each observation). While previous work highlighted the implicit regularization of such algorithms, we consider an explicit regularization framework as a minimum Bregman divergence convex feasibility problem. Using convex duality, we propose randomized Dykstra-style algorithms based on randomized dual coordinate ascent. For non-accelerated coordinate descent, we obtain an algorithm which bears strong similarities with (non-averaged) stochastic mirror descent on specific functions, as it is is equivalent for quadratic objectives, and equivalent in the early iterations for more general objectives. It comes with the benefit of an explicit convergence theorem to a minimum norm solution. For accelerated coordinate descent, we obtain a new algorithm that has better convergence properties than existing stochastic gradient methods in the interpolating regime. This leads to accelerated versions of the perceptron for generic p -norm regularizers, which we illustrate in experiments. * The work was done when Anant Raj was visiting Inria.

show abstract

On the linear convergence of the stochastic gradient method with constant step-size

Cited by 18 publications

References 12 publications

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

Explicit Regularization of Stochastic Gradient Methods through Duality

Contact Info

Product

Resources

About