Sample size selection in optimization methods for machine learning

Byrd, Richard H.; Chin, Gillian M.; Nocedal, Jorge; Wu, Yuchen

doi:10.1007/s10107-012-0572-5

Cited by 281 publications

(279 citation statements)

References 27 publications

(43 reference statements)

Supporting

Mentioning

278

Contrasting

Order By: Relevance

“…Finally, Byrd et al (2012) has developed methods to deal with the mini-batch overfitting problem, which are based on heuristics that increase the mini-batch size and also terminate CG early, according to estimates of the variance of the gradient and curvature-matrix vector products. While this is a potentially effective approach (which we don't have experience with), there are several problems with it, in theory.…”

Section: Mini-batch Overfitting and Methods To Combat Itmentioning

confidence: 99%

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens

Sutskever

2012

Lecture Notes in Computer Science

368

464

View full text Add to dashboard Cite

Section: Mini-batch Overfitting and Methods To Combat Itmentioning

confidence: 99%

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens

Sutskever

2012

Lecture Notes in Computer Science

368

464

View full text Add to dashboard Cite

“…If {p i } is finite, then this result trivially follows. Otherwise, since δ k → 0, we have that for sufficiently large p i , δ p i < b, with b defined by (5). Since ∇f (x p i ) > , and m p i is fully linear on B(x p i , ∆ p i ), then by the derivations in Theorem 4.2, we have g p i ≥ 2 , and by Lemma 3.2 ρ p i ≥ η 1 .…”

Section: The Lim-type Convergencementioning

confidence: 96%

“…The resulting methods may be very simple and enjoy low per-iteration complexity, but the practical performance of these approaches can be very poor. On the other hand, it was noted in [5] that the performance of stochastic gradient methods for large-scale machine learning improves substantially if the sample size is increased during the optimization process. Within direct search, the use of random positive spanning sets has also been recently investigated [1,34] with gains in performance and convergence theory for nonsmooth problems.…”

Section: Motivationmentioning

confidence: 99%

Convergence of Trust-Region Methods Based on Probabilistic Models

Bandeira¹,

Scheinberg²,

Vicente³

2014

SIAM J. Optim.

169

View full text Add to dashboard Cite

In this paper we consider the use of probabilistic or random models within a classical trustregion framework for optimization of deterministic smooth general nonlinear functions. Our method and setting differs from many stochastic optimization approaches in two principal ways. Firstly, we assume that the value of the function itself can be computed without noise, in other words, that the function is deterministic. Secondly, we use random models of higher quality than those produced by usual stochastic gradient methods. In particular, a first order model based on random approximation of the gradient is required to provide sufficient quality of approximation with probability greater than or equal to 1/2. This is in contrast with stochastic gradient approaches, where the model is assumed to be "correct" only in expectation.As a result of this particular setting, we are able to prove convergence, with probability one, of a trust-region method which is almost identical to the classical method. Moreover, the new method is simpler than its deterministic counterpart as it does not require a criticality step. Hence we show that a standard optimization framework can be used in cases when models are random and may or may not provide good approximations, as long as "good" models are more likely than "bad" models. Our results are based on the use of properties of martingales. Our motivation comes from using random sample sets and interpolation models in derivative-free optimization. However, our framework is general and can be applied with any source of uncertainty in the model. We discuss various applications for our methods in the paper.

show abstract

“…In [4] an adaptive sample size strategy was proposed in the setting where ∇f (x) = N i=1 ∇f i (x), for large values of N . In this case computing ∇f (x) accurately can be prohibitive, hence, instead an estimate ∇f S (x) = i∈S ∇f i (x) is often computed in hopes that it provides a good estimate of the gradient and a descent direction.…”

Section: Stochastic Gradients and Batch Samplingmentioning

confidence: 99%

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

2017

View full text Add to dashboard Cite

We present global convergence rates for a line-search method which is based on random first-order models and directions whose quality is ensured only with certain probability. We show that in terms of the order of the accuracy, the evaluation complexity of such a method is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good. We particularize and improve these results in the convex and strongly convex case.We also analyze a probabilistic cubic regularization variant that allows approximate probabilistic second-order models and show improved complexity bounds compared to probabilistic first-order methods; again, as a function of the accuracy, the probabilistic cubic regularization bounds are of the same (optimal) order as for the deterministic case.

show abstract

Sample size selection in optimization methods for machine learning

Cited by 281 publications

References 27 publications

Training Deep and Recurrent Networks with Hessian-Free Optimization

Training Deep and Recurrent Networks with Hessian-Free Optimization

Convergence of Trust-Region Methods Based on Probabilistic Models

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

Contact Info

Product

Resources

About