On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

Byrd, Richard H.; Chin, Gillian M.; Neveitt, Will; Nocedal, Jorge

doi:10.1137/10079923x

Cited by 192 publications

(178 citation statements)

References 9 publications

Supporting

Mentioning

156

Contrasting

Order By: Relevance

“…Martens (2010) recommended using this technique (as does Byrd et al (2011)), and in our experience it can often improve optimization speed if used carefully and in the right contexts. But despite this, there are several good theoretical reasons why it might be better, at least in some situations, to use the same mini-batch to compute gradient and curvature matrix-vector products.…”

Section: Higher Quality Gradient Estimatesmentioning

confidence: 97%

“…And in practice, we have found that such an approach does not seem to work very well, and results in CG itself diverging in some cases. The solution advocated by Martens (2010) and independently by Byrd et al (2011) is to fix the mini-batch used to define B for the entire run of CG. Mini-batches and the practical issues which arise when using them will be discussed in more depth in Section 12.…”

Section: Algorithm 2 Preconditioned Conjugate Gradient Algorithm (Pcg)mentioning

confidence: 99%

“…This is because when we compute estimates of the curvature by averaging, we must divide by the number of cases in the mini-batch, and since this number will be smaller for the curvature estimate than for the gradient, the gradient contributions from these cases will be smaller in proportion to the corresponding curvature terms. Byrd et al (2011) showed that if the eigenvalues of the curvature matrices (estimated using any method) are uniformly bounded from below in the sense that there exists µ > 0 s.t.B − µI is PSD for all possibleB's which we might produce, then assuming the use of a basic line-search and other mild conditions, a truncated Newton algorithm like HF which estimates the gradient on the full training set will converge in the sense that the gradients will go to zero. But this result makes no use of the particular form of B and says nothing about the rate of convergence, or how small the updates will have to be.…”

Section: Higher Quality Gradient Estimatesmentioning

confidence: 99%

See 2 more Smart Citations

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens

Sutskever

2012

Lecture Notes in Computer Science

386

464

View full text Add to dashboard Cite

Section: Higher Quality Gradient Estimatesmentioning

confidence: 97%

Section: Algorithm 2 Preconditioned Conjugate Gradient Algorithm (Pcg)mentioning

confidence: 99%

Section: Higher Quality Gradient Estimatesmentioning

confidence: 99%

See 1 more Smart Citation

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens

Sutskever

2012

Lecture Notes in Computer Science

386

464

View full text Add to dashboard Cite

“…Sample selection also plays a crucial role in the incorporation of curvature information in a Hessian-free Newton method for machine learning [7,21]. In this so-called subsampled Hessian Newton-CG method, the step computation is obtained by applying the conjugate gradient (CG) method, which only requires Hessian-vector products and not the Hessian matrix itself.…”

Section: Introductionmentioning

confidence: 99%

“…This paper can be seen as a continuation of the work in [7], which dealt with Hessian sampling techniques for a Newton-CG method. Here we take this work further, first by considering also sample selection techniques for the evaluation of function and gradients, and second, by studying the extension of Hessian sampling techniques to nonsmooth L 1 regularized problems.…”

Section: Introductionmentioning

confidence: 99%

Sample size selection in optimization methods for machine learning

et al. 2012

Self Cite

View full text Add to dashboard Cite

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an O(1/ ) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1 regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

show abstract

Bibliography

2024

Deep Learning

View full text Add to dashboard Cite

On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

Cited by 192 publications

References 9 publications

Training Deep and Recurrent Networks with Hessian-Free Optimization

Training Deep and Recurrent Networks with Hessian-Free Optimization

Sample size selection in optimization methods for machine learning

Bibliography

Contact Info

Product

Resources

About