2019
DOI: 10.1137/17m1113898
|View full text |Cite
|
Sign up to set email alerts
|

Gradient Descent Finds the Cubic-Regularized Nonconvex Newton Step

Abstract: We consider the minimization of non-convex quadratic forms regularized by a cubic term, which exhibit multiple saddle points and poor local minima. Nonetheless, we prove that, under mild assumptions, gradient descent approximates the global minimum to within ε accuracy in O(ε −1 log(1/ε)) steps for large ε and O(log(1/ε)) steps for small ε (compared to a condition number we define), with at most logarithmic dependence on the problem dimension. When we use gradient descent to approximate the Nesterov-Polyak cub… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

2
91
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(99 citation statements)
references
References 27 publications
2
91
0
Order By: Relevance
“…A direct consequence of Assumption 1 is that for any x ∈ F, it holds that f (p σ (x)) ≤f σ (x) whenever σ ≥ L (see [24,Lemma 4]). This further implies that for all k ≥ 0, we can find a σ k ≤ 2L such that (8) holds. Indeed, if the Lipschitz constant L is known, we can let σ k = L. If not, by using a line search strategy that doubles σ k after each trial [24, Section 5.2], we can find a σ k ≤ 2L such that (8) holds.…”
Section: The Cubic Regularization Methodsmentioning
confidence: 91%
See 1 more Smart Citation
“…A direct consequence of Assumption 1 is that for any x ∈ F, it holds that f (p σ (x)) ≤f σ (x) whenever σ ≥ L (see [24,Lemma 4]). This further implies that for all k ≥ 0, we can find a σ k ≤ 2L such that (8) holds. Indeed, if the Lipschitz constant L is known, we can let σ k = L. If not, by using a line search strategy that doubles σ k after each trial [24, Section 5.2], we can find a σ k ≤ 2L such that (8) holds.…”
Section: The Cubic Regularization Methodsmentioning
confidence: 91%
“…Such observation has led to the development of various efficient algorithms for finding p σ (x) in [10]. More recently, it is shown in [8] that the gradient descent method can also be applied to find p σ (x). For the global convergence of the CR method, we need the following assumption.…”
Section: The Cubic Regularization Methodsmentioning
confidence: 99%
“…Here, L 2 is taken to be the Lipschitz constant of the Hessians (see Definition 10), so as to ensure that the objective function in (158) majorizes the true objective f (·). While the subproblem (158) is nonconvex and may have local minima, it can often be efficiently solved by minimizing an explicitly written univariate convex function [185, Section 5], or even by gradient descent [188].…”
Section: Hessian-based Algorithmsmentioning
confidence: 99%
“…where σ t is the cubic regularization parameter chosen for the current iteration. As in the case of TR, the major bottleneck of CR involved solving the sub-problem (2b), for which various techniques have been proposed, e.g., [1,4,8,9]. To the best of our knowledge, the use of such regularization, was first introduced in the pioneering work of [34], and subsequently further studied in the seminal works of [9,10,45].From the worst-case complexity point of view, CR has a better dependence on ǫ g compared to TR.…”
Section: Cubic Regularizationmentioning
confidence: 99%