Hessian averaging in stochastic Newton methods achieves superlinear convergence

Na, Sen; Dereziński, Michał; Mahoney, Michael W.

doi:10.1007/s10107-022-01913-5

Cited by 3 publications

(2 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Assumption 3.2(1) ensures that the stochastic gradient ∇ (x; ξ) is an unbiased estimator of the gradient ∇f (x) for all x ∈ R n . Assumption 3.2(2) provides a constant upper bound on the norm of an element of ∂r(x) for all x ∈ R n , which exists when r is the weighted 1 -norm or weighted group Assumption 3.2(3) by assuming that the stochastic gradient error has a sub-exponential tail, e.g., Na et al (2022), which we leave as future work. Assumption 3.2(4) is implied by the following two, perhaps more natural, assumptions: (i) There exists a constant c e > 0 such that, for all k, it holds that P{ d k − ∇f (x k ) ≤ c e |F k } = 1, i.e., the error in the stochastic gradient estimator d k is almost surely bounded; and (ii) There exists a constant c α such that, for a given α > 0 and all k ≥ 1, it holds that P{χ(x k ; α) ≤ c α |F k } = 1 (also see (3)), i.e., the optimality measure is almost surely bounded.…”

Section: Assumptionsmentioning

confidence: 99%

A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization

Dai¹,

Wang²,

Curtis³

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper introduces a new proximal stochastic gradient method with variance reduction and stabilization for minimizing the sum of a convex stochastic function and a group sparsityinducing regularization function. Since the method may be viewed as a stabilized version of the recently proposed algorithm PStorm, we call our algorithm S-PStorm. Our analysis shows that S-PStorm has strong convergence results. In particular, we prove an upper bound on the number of iterations required by S-PStorm before its iterates correctly identify (with high probability) an optimal support (i.e., the zero and nonzero structure of an optimal solution). Most algorithms in the literature with such a support identification property use variance reduction techniques that require either periodically evaluating an exact gradient or storing a history of stochastic gradients. Unlike these methods, S-PStorm achieves variance reduction without requiring either of these, which is advantageous. Moreover, our support-identification result for S-PStorm shows that, with high probability, an optimal support will be identified correctly in all iterations with index above a threshold. We believe that this type of result is new to the literature since the few existing other results prove that the optimal support is identified with high probability at each iteration with a sufficiently large index (meaning that the optimal support might be identified in some iterations, but not in others). Numerical experiments on regularized logistic loss problems show that S-PStorm outperforms existing methods in various metrics that measure how efficiently and robustly iterates of an algorithm identify an optimal support.

show abstract

Section: Assumptionsmentioning

confidence: 99%

A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization

Dai¹,

Wang²,

Curtis³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Standard RandNLA guarantees such as the subspace embedding are sufficient (although not necessary) to ensure that Ĥt provides a good enough approximation to enable accelerated local convergence in time Õ(nd). These approaches have also been extended to distributed settings via RMT-based model averaging, with applications in ensemble methods, distributed optimization, and federated learning [110,109,144,48,92]. Further RandNLA-based Newton-type methods include: Subsampled Newton [75,159,18,17]; Hessian approximations via randomized Taylor expansion [1] and low-rank approximation [77,55]; Hessian diagonal/trace estimates via Hutchinson's method [136] and Stochastic Lanczos Quadrature, particularly for non-convex problems, e.g., PyHessian [176], AdaHessian [177]; and finally Stochastic Quasi-Newton type methods [106,137].…”

Section: Hessian Sketchmentioning

confidence: 99%

Determinantal Point Processes in Randomized Numerical Linear Algebra

Dereziński¹,

Mahoney²

2021

Notices Amer. Math. Soc.

View full text Add to dashboard Cite

Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trends, efforts to incorporate RandNLA algorithms into core numerical libraries, and advances in machine learning, statistics, and random matrix theory, have lead to new theoretical and practical challenges. This article provides a self-contained overview of RandNLA, in light of these developments.

show abstract