2022
DOI: 10.1007/s10107-022-01913-5
|View full text |Cite
|
Sign up to set email alerts
|

Hessian averaging in stochastic Newton methods achieves superlinear convergence

Abstract: We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks, e.g., training machine learning models. Despite using second-order information, these existing methods do not exhibit superlinear conver… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 41 publications
0
1
0
Order By: Relevance
“…Assumption 3.2(1) ensures that the stochastic gradient ∇ (x; ξ) is an unbiased estimator of the gradient ∇f (x) for all x ∈ R n . Assumption 3.2(2) provides a constant upper bound on the norm of an element of ∂r(x) for all x ∈ R n , which exists when r is the weighted 1 -norm or weighted group Assumption 3.2(3) by assuming that the stochastic gradient error has a sub-exponential tail, e.g., Na et al (2022), which we leave as future work. Assumption 3.2(4) is implied by the following two, perhaps more natural, assumptions: (i) There exists a constant c e > 0 such that, for all k, it holds that P{ d k − ∇f (x k ) ≤ c e |F k } = 1, i.e., the error in the stochastic gradient estimator d k is almost surely bounded; and (ii) There exists a constant c α such that, for a given α > 0 and all k ≥ 1, it holds that P{χ(x k ; α) ≤ c α |F k } = 1 (also see (3)), i.e., the optimality measure is almost surely bounded.…”
Section: Assumptionsmentioning
confidence: 99%
“…Assumption 3.2(1) ensures that the stochastic gradient ∇ (x; ξ) is an unbiased estimator of the gradient ∇f (x) for all x ∈ R n . Assumption 3.2(2) provides a constant upper bound on the norm of an element of ∂r(x) for all x ∈ R n , which exists when r is the weighted 1 -norm or weighted group Assumption 3.2(3) by assuming that the stochastic gradient error has a sub-exponential tail, e.g., Na et al (2022), which we leave as future work. Assumption 3.2(4) is implied by the following two, perhaps more natural, assumptions: (i) There exists a constant c e > 0 such that, for all k, it holds that P{ d k − ∇f (x k ) ≤ c e |F k } = 1, i.e., the error in the stochastic gradient estimator d k is almost surely bounded; and (ii) There exists a constant c α such that, for a given α > 0 and all k ≥ 1, it holds that P{χ(x k ; α) ≤ c α |F k } = 1 (also see (3)), i.e., the optimality measure is almost surely bounded.…”
Section: Assumptionsmentioning
confidence: 99%
“…Standard RandNLA guarantees such as the subspace embedding are sufficient (although not necessary) to ensure that Ĥt provides a good enough approximation to enable accelerated local convergence in time Õ(nd). These approaches have also been extended to distributed settings via RMT-based model averaging, with applications in ensemble methods, distributed optimization, and federated learning [110,109,144,48,92]. Further RandNLA-based Newton-type methods include: Subsampled Newton [75,159,18,17]; Hessian approximations via randomized Taylor expansion [1] and low-rank approximation [77,55]; Hessian diagonal/trace estimates via Hutchinson's method [136] and Stochastic Lanczos Quadrature, particularly for non-convex problems, e.g., PyHessian [176], AdaHessian [177]; and finally Stochastic Quasi-Newton type methods [106,137].…”
Section: Hessian Sketchmentioning
confidence: 99%