Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Fang, Cong; Lin, Zhouchen; Zhang, Tong

doi:10.48550/arxiv.1902.00247

Cited by 19 publications

(37 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When Assumption C holds, the communication can probably be improved by the factor of ε − 1 /4 using techniques from Fang et al [2019], which achieve Õ(ε −3.5 ) convergence rate under Assumption C outperforming Õ(ε −4 ) from Jin et al [2021] by the factor of ε − 1 /2 . When balancing the terms in Theorems 3.3 and 3.4, the communication improvement will be the square root of this value.…”

Section: Discussionmentioning

confidence: 99%

“…There are also a number of algorithms designed for finite sum setting where f (x) = n i=1 f i (x) [Reddi et al, 2017, Allen-Zhu and Li, 2018, Fang et al, 2018, or in case when only stochastic gradients are available [Tripuraneni et al, 2018, Jin et al, 2021, including variance reduction techniques [Allen-Zhu, 2018, Fang et al, 2018]. The sharpest rates in these settings have been obtained by Fang et al [2018], Zhou and Gu [2019] and Fang et al [2019].…”

Section: Related Workmentioning

confidence: 99%

“…Compared with the uncompressed case, the total communication improves by ε − 1 /4 when the stochastic gradient is Lipschitz and by √ dε − 3 /4 otherwise (the sharpest results for SGD are by Fang et al [2019] and Jin et al [2021] respectively). In Theorem 1.1, we heavily rely on the following property of RandomK: when its randomness (i.e.…”

Section: Our Contributionsmentioning

confidence: 99%

“…We use the following standard [Jin et al, 2021, Fang et al, 2019, Allen-Zhu, 2018, Zhou et al, 2018 assumptions about the objective function f : Assumption A f is f max -bounded, L-smooth and has ρ-Lipschitz Hessian, i.e. for all x, y:…”

Section: Preliminariesmentioning

confidence: 99%

See 3 more Smart Citations

Escaping Saddle Points with Compressed SGD

Avdiukhin¹,

Yaroslavtsev²

2021

Preprint

View full text Add to dashboard Cite

Stochastic gradient descent (SGD) is a prevalent optimization technique for large-scale distributed machine learning. While SGD computation can be efficiently divided between multiple machines, communication typically becomes a bottleneck in the distributed setting. Gradient compression methods can be used to alleviate this problem, and a recent line of work shows that SGD augmented with gradient compression converges to an ε-first-order stationary point. In this paper we extend these results to convergence to an ε-second -order stationary point (ε-SOSP), which is to the best of our knowledge the first result of this type. In addition, we show that, when the stochastic gradient is not Lipschitz, compressed SGD with RandomK compressor converges to an ε-SOSP with the same number of iterations as uncompressed SGD [Jin et al., 2021] (JACM), while improving the total communication by a factor of Θ( √ dε − 3 /4 ), where d is the dimension of the optimization problem. We present additional results for the cases when the compressor is arbitrary and when the stochastic gradient is Lipschitz.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Our Contributionsmentioning

confidence: 99%

Section: Preliminariesmentioning

confidence: 99%

See 2 more Smart Citations

Escaping Saddle Points with Compressed SGD

Avdiukhin¹,

Yaroslavtsev²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• Can we achieve the polynomial speedup in log n for more advanced stochastic optimization algorithms with complexity Õ(poly(log n)/ 3.5 ) [2,3,9,28,30] or Õ(poly(log n)/ 3 ) [8, 33]?…”

Section: Settingmentioning

confidence: 99%

Escape saddle points by a simple gradient-descent based algorithm

Zhang¹,

2021

Preprint

View full text Add to dashboard Cite

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f : R n → R, it outputs an -approximate second-order stationary point in Õ(log n/ 1.75 ) iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with Õ(log 4 n/ 2 ) or Õ(log 6 n/ 1.75 ) iterations, our algorithm is polynomially better in terms of log n and matches their complexities in terms of 1/ . For the stochastic setting, our algorithm outputs an -approximate second-order stationary point in Õ(log 2 n/ 4 ) iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in log n compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

show abstract

Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning

George

Yang

Bai

et al. 2019

2019 IEEE 58th Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

We develop a distributed stochastic gradient descent algorithm for solving non-convex optimization problems under the assumption that the local objective functions are twice continuously differentiable with Lipschitz continuous gradients and Hessians. We provide sufficient conditions on step-sizes that guarantee the asymptotic mean-square convergence of the proposed algorithm. We apply the developed algorithm to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural nets to recognize handwritten digits in images. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural net. Numerical results also show that the proposed distributed algorithm allows the individual agents to recognize the digits even though the training data corresponding to all the digits is not locally available to each agent.

show abstract

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Cited by 19 publications

References 26 publications

Escaping Saddle Points with Compressed SGD

Escaping Saddle Points with Compressed SGD

Escape saddle points by a simple gradient-descent based algorithm

Distributed Stochastic Gradient Method for Non-Convex Problems with Applications in Supervised Learning

Contact Info

Product

Resources

About