Convergence Rate of Incremental Gradient and Incremental Newton Methods

Gürbüzbalaban, Mert; Ozdaglar, Asuman; Parrilo, Pablo A.

doi:10.1137/17m1147846

Cited by 24 publications

(35 citation statements)

References 39 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It analytically showed that, for strongly convex objective functions, the convergence rate under random reshuffling can be improved from O(1/i) in vanilla SGD [25] to O(1/i 2 ). The incremental gradient methods [26], [27], which can be viewed as the deterministic version of random reshuffling, shares similar conclusions, i.e., random reshuffling helps accelerate the convergence rate from O(1/i) to O(1/i 2 ) under decaying step-sizes. Also, in the work [24], it establishes that random reshuffling will not degrade performance relative to the stochastic gradient descent implementation, provided the number of epochs is not too large.…”

Section: Motivationmentioning

confidence: 78%

“…To proceed, we will ignore the last two terms in (27) and consider the following approximate model, which we shall refer to as a long-term model.…”

Section: A Error Dynamicsmentioning

confidence: 99%

“…Obviously, the state evolution will be different than (27) and is therefore denoted by the prime notation, w k 0 . Observe, however, that in model (28) the gradient noise process is still being evaluated at the original state vector, w k 0 , and not at the new state vector, w k 0 .…”

Section: A Error Dynamicsmentioning

confidence: 99%

“…To do that, we proceed in two steps. First, we introduce an auxiliary long-term model in (28) below and subsequently determine how far the performance of this model is from the original system described by (27) further ahead.…”

Section: Introducing a Long-term Modelmentioning

confidence: 99%

See 3 more Smart Citations

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

Ying

Yuan

Vlaski

et al. 2019

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

In empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. Recent works have pursued justifications for this behavior by examining the convergence rate of the learning process under diminishing step-sizes. This work focuses on the constant step-size case and strongly convex loss functions. In this case, convergence is guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis establishes analytically that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size O(µ 2 ) around the minimizer rather than O(µ). Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail the differences between sampling with and without replacement. We also explain the periodic behavior that is observed in random reshuffling implementations.

show abstract

Section: Motivationmentioning

confidence: 78%

“…To proceed, we will ignore the last two terms in (27) and consider the following approximate model, which we shall refer to as a long-term model.…”

Section: A Error Dynamicsmentioning

confidence: 99%

Section: A Error Dynamicsmentioning

confidence: 99%

Section: Introducing a Long-term Modelmentioning

confidence: 99%

See 2 more Smart Citations

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

Ying

Yuan

Vlaski

et al. 2019

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

show abstract

“…For example, in [13], another variance-reduction algorithm is proposed under reshuffling; however, no proof of convergence is provided. The closest attempts at proof are the useful arguments given in [14], [15], which deal with special problem formulations. The work [14] deals with the case of incremental aggregated gradients, which corresponds to a deterministic version of RR for SAG, while the work [15] deals with SVRG in the context of ridge regression problems using regret analysis.…”

Section: Introductionmentioning

confidence: 99%

Variance-Reduced Stochastic Learning Under Random Reshuffling

Ying

Yuan

Sayed

2020

IEEE Trans. Signal Process.

View full text Add to dashboard Cite

Several useful variance-reduced stochastic gradient algorithms, such as SVRG, SAGA, Finito, and SAG, have been proposed to minimize empirical risks with linear convergence properties to the exact minimizer. The existing convergence results assume uniform data sampling with replacement. However, it has been observed in related works that random reshuffling can deliver superior performance over uniform sampling and, yet, no formal proofs or guarantees of exact convergence exist for variance-reduced algorithms under random reshuffling. This paper makes two contributions. First, it resolves this open issue and provides the first theoretical guarantee of linear convergence under random reshuffling for SAGA; the argument is also adaptable to other variance-reduced algorithms. Second, under random reshuffling, the paper proposes a new amortized variance-reduced gradient (AVRG) algorithm with constant storage requirements compared to SAGA and with balanced gradient computations compared to SVRG. AVRG is also shown analytically to converge linearly.

show abstract

Why random reshuffling beats stochastic gradient descent

2019

Self Cite

View full text Add to dashboard Cite

We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by providing various convergence rate results for RR and variants when the sum function is strongly convex. We first focus on quadratic component functions and show that the expected distance of the iterates generated by RR with stepsize α k = Θ(1/k s ) for s ∈ (0, 1] converges to zero at rate O(1/k s ) (with s = 1 requiring adjusting the stepsize to the strong convexity constant). Our main result shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize α k = Θ(1/k s ) for s ∈ (1/2, 1) converges at rate Θ(1/k 2s ) with probability one in the suboptimality of the objective value, thus improving upon the Ω(1/k) rate of SGD. Our analysis draws on the theory of Polyak-Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by α 2 k . This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate O( 1 k 2 ).

show abstract

Convergence Rate of Incremental Gradient and Incremental Newton Methods

Cited by 24 publications

References 39 publications

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

Stochastic Learning Under Random Reshuffling With Constant Step-Sizes

Variance-Reduced Stochastic Learning Under Random Reshuffling

Why random reshuffling beats stochastic gradient descent

Contact Info

Product

Resources

About