Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Jin, Chi; Netrapalli, Praneeth; Jordan, Michael I.

doi:10.48550/arxiv.1711.10456

Cited by 61 publications

(79 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our main contribution is a simple, single-loop, and robust gradient-based algorithm that can find an -approximate second-order stationary point of a smooth, Hessian Lipschitz function f : R n → R. Compared to previous works [3,24,29] exploiting the idea of gradient-based Hessian power method, our algorithm has a single-looped, simpler structure and better numerical stability. Compared to the previous state-of-the-art results with single-looped structures by [21] and [19,20] using Õ(log 6 n/ 1.75 ) or Õ(log 4 n/ 2 ) iterations, our algorithm achieves a polynomial speedup in log n: Theorem 1 (informal). Our single-looped algorithm finds an -approximate second-order stationary point in Õ(log n/ 1.75 ) iterations.…”

Section: Introductionmentioning

confidence: 89%

“…A seminal work along this line was by Ge et al [11], which found an -approximate second-order stationary point satisfying (1) using only gradients in O(poly(n, 1/ )) iterations. This is later improved to be almost dimension-free Õ(log 4 n/ 2 ) in the follow-up work [19], 2 and the perturbed accelerated gradient descent algorithm [21] based on Nesterov's accelerated gradient descent [26] takes Õ(log 6 n/ 1.75 ) iterations. However, these results still suffer from a significant overhead in terms of log n. On the other direction, Refs.…”

Section: Introductionmentioning

confidence: 99%

“…Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19,20] and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD applies gradient descents iteratively until it reaches a point with small gradient, which can be a potential saddle point.…”

Section: Introductionmentioning

confidence: 99%

“…It is demonstrated that, with an appropriate choice of the perturbation radius, PGD can shake the point off from the neighborhood of the saddle point and converge to a second-order stationary point with high probability. The PAGD in [21] adopts a similar perturbation idea, but the GD is replaced by Nesterov's AGD [26].…”

Section: Introductionmentioning

confidence: 99%

“…See Proposition 3 and Proposition 5. After escaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations, which are efficient to decrease the function value when the gradient is large [19,20,21].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Escape saddle points by a simple gradient-descent based algorithm

Zhang¹,

2021

Preprint

View full text Add to dashboard Cite

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f : R n → R, it outputs an -approximate second-order stationary point in Õ(log n/ 1.75 ) iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with Õ(log 4 n/ 2 ) or Õ(log 6 n/ 1.75 ) iterations, our algorithm is polynomially better in terms of log n and matches their complexities in terms of 1/ . For the stochastic setting, our algorithm outputs an -approximate second-order stationary point in Õ(log 2 n/ 4 ) iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in log n compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

show abstract

Section: Introductionmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Escape saddle points by a simple gradient-descent based algorithm

Zhang¹,

2021

Preprint

View full text Add to dashboard Cite

show abstract

Behavior of accelerated gradient methods near critical points of nonconvex functions

O’Neill

Wright

2018

Math. Program.

View full text Add to dashboard Cite

We examine the behavior of accelerated gradient methods in smooth nonconvex unconstrained optimization, focusing in particular on their behavior near strict saddle points. Accelerated methods are iterative methods that typically step along a direction that is a linear combination of the previous step and the gradient of the function evaluated at a point at or near the current iterate. (The previous step encodes gradient information from earlier stages in the iterative process.) We show by means of the stable manifold theorem that the heavy-ball method is unlikely to converge to strict saddle points, which are points at which the gradient of the objective is zero but the Hessian has at least one negative eigenvalue. We then examine the behavior of the heavy-ball method and other accelerated gradient methods in the vicinity of a strict saddle point of a nonconvex quadratic function, showing that both methods can diverge from this point more rapidly than the steepest-descent method.

show abstract

Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval

et al. 2019

View full text Add to dashboard Cite

This paper considers the problem of solving systems of quadratic equations, namely, recovering an object of interest x ∈ R n from m quadratic equations / samples yi = (a i x ) 2 , 1 ≤ i ≤ m. This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning.We investigate the efficacy of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent -when randomly initialized -yields an -accurate solution in O log n + log(1/ ) iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.

show abstract

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Cited by 61 publications

References 9 publications

Escape saddle points by a simple gradient-descent based algorithm

Escape saddle points by a simple gradient-descent based algorithm

Behavior of accelerated gradient methods near critical points of nonconvex functions

Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval

Contact Info

Product

Resources

About