2011
DOI: 10.1137/10079923x
|View full text |Cite
|
Sign up to set email alerts
|

On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

Abstract: This paper describes how to incorporate sampled curvature information in a Newton-CG method and in a limited memory quasi-Newton method for statistical learning. The motivation for this work stems from supervised machine learning applications involving a very large number of training points. We follow a batch approach, also known in the stochastic optimization literature as a sample average approximation (SAA) approach. Curvature information is incorporated in two sub-sampled Hessian algorithms, one based on a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
156
0

Year Published

2012
2012
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 192 publications
(178 citation statements)
references
References 9 publications
0
156
0
Order By: Relevance
“…Martens (2010) recommended using this technique (as does Byrd et al (2011)), and in our experience it can often improve optimization speed if used carefully and in the right contexts. But despite this, there are several good theoretical reasons why it might be better, at least in some situations, to use the same mini-batch to compute gradient and curvature matrix-vector products.…”
Section: Higher Quality Gradient Estimatesmentioning
confidence: 97%
See 2 more Smart Citations
“…Martens (2010) recommended using this technique (as does Byrd et al (2011)), and in our experience it can often improve optimization speed if used carefully and in the right contexts. But despite this, there are several good theoretical reasons why it might be better, at least in some situations, to use the same mini-batch to compute gradient and curvature matrix-vector products.…”
Section: Higher Quality Gradient Estimatesmentioning
confidence: 97%
“…And in practice, we have found that such an approach does not seem to work very well, and results in CG itself diverging in some cases. The solution advocated by Martens (2010) and independently by Byrd et al (2011) is to fix the mini-batch used to define B for the entire run of CG. Mini-batches and the practical issues which arise when using them will be discussed in more depth in Section 12.…”
Section: Algorithm 2 Preconditioned Conjugate Gradient Algorithm (Pcg)mentioning
confidence: 99%
See 1 more Smart Citation
“…Sample selection also plays a crucial role in the incorporation of curvature information in a Hessian-free Newton method for machine learning [7,21]. In this so-called subsampled Hessian Newton-CG method, the step computation is obtained by applying the conjugate gradient (CG) method, which only requires Hessian-vector products and not the Hessian matrix itself.…”
Section: Introductionmentioning
confidence: 99%
“…This paper can be seen as a continuation of the work in [7], which dealt with Hessian sampling techniques for a Newton-CG method. Here we take this work further, first by considering also sample selection techniques for the evaluation of function and gradients, and second, by studying the extension of Hessian sampling techniques to nonsmooth L 1 regularized problems.…”
Section: Introductionmentioning
confidence: 99%