Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop

Kovalev, D. Yu.; Horváth, Samuel; Richtárik, Peter

doi:10.48550/arxiv.1901.08689

Cited by 19 publications

(47 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that those findings are in accord with Corollary H.3. Similar results were shown in [15] for LSVRG. 4.…”

Section: D2 Svrcd: Effect Of ρsupporting

confidence: 88%

“…Thus, for all j, it does not make sense to increase sampling size beyond point where p i t q t j ≥ 1 n as the convergence speed would not increase significantly 15 .…”

Section: Algorithm 16 Isaega [New Method]mentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 2 more Smart Citations

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Hanzely,

Richtárik

2019

Preprint

View full text Add to dashboard Cite

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as SAGA [3], LSVRG [12,15], JacSketch [9], SEGA [10] and ISEGA [21], and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.

show abstract

“…Note that those findings are in accord with Corollary H.3. Similar results were shown in [15] for LSVRG. 4.…”

Section: D2 Svrcd: Effect Of ρsupporting

confidence: 88%

“…Thus, for all j, it does not make sense to increase sampling size beyond point where p i t q t j ≥ 1 n as the convergence speed would not increase significantly 15 .…”

Section: Algorithm 16 Isaega [New Method]mentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 1 more Smart Citation

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Hanzely,

Richtárik

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…in case ii) one may consider the batch gradient ∇f (x k ) if n is small, or a variance-reduced gradient estimator, such as SVRG [28,31] or SAGA [17,46], if n is large. Our general analysis allows for any estimator to be used as long as it satisfies a certain technical assumption (Assumption 2).…”

Section: Summary Of Contributionsmentioning

confidence: 99%

A Stochastic Decoupling Method for Minimizing the Sum of Smooth and Non-Smooth Functions

Mishchenko¹,

Richtárik²

2019

Preprint

View full text Add to dashboard Cite

We consider the problem of minimizing the sum of three convex functions: i) a smooth function f in the form of an expectation or a finite average, ii) a non-smooth function g in the form of a finite average of proximable functions g j , and iii) a proximable regularizer R. We design a variance reduced method which is able progressively learn the proximal operator of g via the computation of the proximal operator of a single randomly selected function g j in each iteration only. Our method can provably and efficiently accommodate many strategies for the estimation of the gradient of f , including via standard and variance-reduced stochastic estimation, effectively decoupling the smooth part of the problem from the non-smooth part. We prove a number of iteration complexity results, including a general O( 1 /t) rate, O( 1 /t 2 ) rate in the case of strongly convex f , and several linear rates in special cases, including accelerated linear rate. For example, our method achieves a linear rate for the problem of minimizing a strongly convex function f under linear constraints under no assumption on the constraints beyond consistency. When combined with SGD or SAGA estimators for the gradient of f , this leads to a very efficient method for empirical risk minimization with large linear constraints. Our method generalizes several existing algorithms, including forward-backward splitting, Douglas-Rachford splitting, proximal SGD, proximal SAGA, SDCA, randomized Kaczmarz and Point-SAGA. However, our method leads to many new specific methods in special cases; for instance, we obtain the first randomized variant of the Dykstra's method for projection onto the intersection of closed convex sets. 1 The proximal operator of function R is defined as prox ηR (x) := argmin u∈R d R(u) + 1 2η u − x 2 .Preprint. Under review.

show abstract

“…Recently, an error compensated method called EC-LSVRG-DIANA which can achieve linear convergence for the strongly convex and smooth case was proposed by Gorbunov et al [2020], but besides the contraction compressor, the unbiased compressor is also needed in the algorithm. In this paper, we study the error compensated methods for loopless SVRG (L-SVRG) [Kovalev et al, 2019], Quartz [Qu et al, 2015], and SDCA [Shalev-Shwartz and Zhang, 2013], where only contraction compressors are needed.…”

Section: Introductionmentioning

confidence: 99%

Error Compensated Loopless SVRG, Quartz, and SDCA for Distributed Optimization

Qian,

Dong,

Richtárik

et al. 2021

Preprint

View full text Add to dashboard Cite

The communication of gradients is a key bottleneck in distributed training of large scale machine learning models. In order to reduce the communication cost, gradient compression (e.g., sparsification and quantization) and error compensation techniques are often used. In this paper, we propose and study three new efficient methods in this space: error compensated loopless SVRG method (EC-LSVRG), error compensated Quartz (EC-Quartz), and error compensated SDCA (EC-SDCA). Our method is capable of working with any contraction compressor (e.g., TopK compressor), and we perform analysis for convex optimization problems in the composite case and smooth case for EC-LSVRG. We prove linear convergence rates for both cases and show that in the smooth case the rate has a better dependence on the parameter associated with the contraction compressor. Further, we show that in the smooth case, and under some certain conditions, error compensated loopless SVRG has the same convergence rate as the vanilla loopless SVRG method. Then we show that the convergence rates of EC-Quartz and EC-SDCA in the composite case are as good as EC-LSVRG in the smooth case. Finally, numerical experiments are presented to illustrate the efficiency of our methods. Contents

show abstract

Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop

Cited by 19 publications

References 12 publications

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

A Stochastic Decoupling Method for Minimizing the Sum of Smooth and Non-Smooth Functions

Error Compensated Loopless SVRG, Quartz, and SDCA for Distributed Optimization

Contact Info

Product

Resources

About