On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Jin, Chi; Netrapalli, Praneeth; Ge, Rong; Kakade, Sham M.; Jordan, Michael I.

doi:10.48550/arxiv.1902.04811

Cited by 21 publications

(65 citation statements)

References 20 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our main contribution is a simple, single-loop, and robust gradient-based algorithm that can find an -approximate second-order stationary point of a smooth, Hessian Lipschitz function f : R n → R. Compared to previous works [3,24,29] exploiting the idea of gradient-based Hessian power method, our algorithm has a single-looped, simpler structure and better numerical stability. Compared to the previous state-of-the-art results with single-looped structures by [21] and [19,20] using Õ(log 6 n/ 1.75 ) or Õ(log 4 n/ 2 ) iterations, our algorithm achieves a polynomial speedup in log n: Theorem 1 (informal). Our single-looped algorithm finds an -approximate second-order stationary point in Õ(log n/ 1.75 ) iterations.…”

Section: Introductionmentioning

confidence: 89%

“…We further assume that the stochastic gradients are Lipschitz (or equivalently, the underlying functions are gradient-Lipschitz, see Assumption 2), which is also adopted in most of the existing works; see e.g. [8,19,20,34]. We demonstrate that a simple extended version of our algorithm takes O(log 2 n) iterations to detect a negative curvature direction using only stochastic gradients, and then obtain an Ω(1) function value decrease with high probability.…”

Section: Introductionmentioning

confidence: 97%

“…Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19,20] and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD applies gradient descents iteratively until it reaches a point with small gradient, which can be a potential saddle point.…”

Section: Introductionmentioning

confidence: 99%

“…See Proposition 3 and Proposition 5. After escaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations, which are efficient to decrease the function value when the gradient is large [19,20,21].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Escape saddle points by a simple gradient-descent based algorithm

Zhang¹,

2021

Preprint

View full text Add to dashboard Cite

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f : R n → R, it outputs an -approximate second-order stationary point in Õ(log n/ 1.75 ) iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with Õ(log 4 n/ 2 ) or Õ(log 6 n/ 1.75 ) iterations, our algorithm is polynomially better in terms of log n and matches their complexities in terms of 1/ . For the stochastic setting, our algorithm outputs an -approximate second-order stationary point in Õ(log 2 n/ 4 ) iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in log n compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

show abstract

Section: Introductionmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Escape saddle points by a simple gradient-descent based algorithm

Zhang¹,

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Motivated by recent work on escaping saddle points (Ge et al, 2015;Lee et al, 2016;Jin et al, 2019), one can show that SSGD algorithm equipped with the aforementioned artificial noise injection escapes from all saddle points, and hence the initialization condition ( 14) can be dropped. First, we generalize Assumption 2.1 for local convergence to the following for global convergence:…”

Section: Global Convergence Analysismentioning

confidence: 99%

Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

Li¹,

Jordan²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Motivated by the problem of online canonical correlation analysis, we propose the Stochastic Scaled-Gradient Descent (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of 1/T , and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.

show abstract

GO Hessian for Expectation-Based Objectives

Yang

Zhao

et al. 2021

AAAI

View full text Add to dashboard Cite

An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives E_q_γ(y) [f(y)], where the random variable (RV) y may be drawn from a stochastic computation graph (SCG) with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Based on the GO gradient, we present for E_q_γ(y) [f(y)] an unbiased low-variance Hessian estimator, named GO Hessian, which contains the deterministic Hessian as a special case. Considering practical implementation, we reveal that the GO Hessian in expectation obeys the chain rule and is therefore easy-to-use with auto-differentiation and Hessian-vector products, enabling efficient cheap exploitation of curvature information over deep SCGs. As representative examples, we present the GO Hessian for non-reparameterizable gamma and negative binomial RVs/nodes. Leveraging the GO Hessian, we develop a new second-order method for E_q_γ(y) [f(y)], with challenging experiments conducted to verify its effectiveness and efficiency.

show abstract

On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

Cited by 21 publications

References 20 publications

Escape saddle points by a simple gradient-descent based algorithm

Escape saddle points by a simple gradient-descent based algorithm

Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

GO Hessian for Expectation-Based Objectives

Contact Info

Product

Resources

About