A robust adaptive stochastic gradient method for deep learning

Gülçehre, Çağlar; Sotelo, Jose; Moczulski, Marcin; Bengio, Yoshua

doi:10.1109/ijcnn.2017.7965845

Cited by 38 publications

(50 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper uses a simplification of the second order Hessian-Free optimization [Martens, 2010] to estimate optimal values for the learning rate. In addition, we utilize some of the techniques described in Schaul et al [2013] and Gulcehre et al [2017]. Also, we show that adaptive learning rate methods such as Nesterov momentum [Sutskever et al, 2013, Nesterov, 1983, AdaDelta [Duchi et al, 2011], AdaGrad [Zeiler, 2012], and Adam [Kingma and Ba, 2014] do not use sufficiently large learning rates when they are effective nor do they lead to super-convergence.…”

Section: Introductionmentioning

confidence: 95%

See 1 more Smart Citation

Super-convergence: very fast training of neural networks using large learning rates

Smith

Topin

2019

Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications

778

379

View full text Add to dashboard Cite

In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elements of super-convergence is training with one learning rate cycle and a large maximum learning rate. A primary insight that allows super-convergence training is that large learning rates regularize the training, hence requiring a reduction of all other forms of regularization in order to preserve an optimal regularization balance. We also derive a simplification of the Hessian Free optimization method to compute an estimate of the optimal learning rate. Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet datasets, and resnet, wide-resnet, densenet, and inception architectures. In addition, we show that super-convergence provides a greater boost in performance relative to standard training when the amount of labeled training data is limited. The architectures to replicate this work will be made available upon publication.

show abstract

Section: Introductionmentioning

confidence: 95%

“…where δ should be in the direction of the steepest descent. The AdaSecant method [Gulcehre et al, 2014[Gulcehre et al, , 2017 builds an adaptive learning rate method based on this finite difference approximation as:…”

Section: Introductionmentioning

confidence: 99%

Super-convergence: very fast training of neural networks using large learning rates

Smith

Topin

2019

Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications

778

379

View full text Add to dashboard Cite

show abstract

“…The cascaded training procedure has some similarities in spirit to the incremental training procedure of Zhou et al [31], but that work considered only DAs with one level of noise. Usefulness of varying the level of noise during training of neural nets was also noticed by Gulcehre et al [15], who add noise to the activation functions. Our training procedure also resembles the walkback training suggested by Bengio et al [5], however, we do not require our training loss to be interpretable as negative log-likelihood.…”

Section: Discussionmentioning

confidence: 80%

Composite Denoising Autoencoders

Geras

Sutton

2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. In representation learning, it is often desirable to learn features at different levels of scale. For example, in image data, some edges will span only a few pixels, whereas others will span a large portion of the image. We introduce an unsupervised representation learning method called a composite denoising autoencoder (CDA) to address this. We exploit the observation from previous work that in a denoising autoencoder, training with lower levels of noise results in more specific, fine-grained features. In a CDA, different parts of the network are trained with different versions of the same input, corrupted at different noise levels. We introduce a novel cascaded training procedure which is designed to avoid types of bad solutions that are specific to CDAs. We show that CDAs learn effective representations on two different image data sets.

show abstract

“…As one of the problems in the EMA scheme, the lack of robustness has been dealt with in [19] and [20]. In those methods, the exponential decay parameter of the EMA is increased whenever a value that falls beyond some boundary is encountered.…”

Section: ) Our Contributionmentioning

confidence: 99%

“…The main advantage of this approach is that it relies on the natural robustness of the student-t distribution and its ability to deal with outliers, and can easily be reduced to the conventional momentum for nonheavy-tailed data. Since the EMA-based first-order momentum is the key of the state-of-the-art SGD methods, our t-momentum can be integrated to various methods like Adam [7], RMSProp [17], VSGD-fd [19], Adasecant [20] or Adabound [18]. Specifically, in this article, we mainly focus on Adam with t-momentum, named t-Adam, to investigate its theoretical performance.…”

Section: ) Our Contributionmentioning

confidence: 99%