2017 International Joint Conference on Neural Networks (IJCNN) 2017
DOI: 10.1109/ijcnn.2017.7965845
|View full text |Cite
|
Sign up to set email alerts
|

A robust adaptive stochastic gradient method for deep learning

Abstract: Abstract-Stochastic gradient algorithms are the main focus of large-scale optimization problems and led to important successes in the recent advancement of the deep learning algorithms. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose an adaptive learning rate algorithm, which utilizes stochastic curvature information of the loss function for automatically tuning the learning rates. The informat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
34
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(50 citation statements)
references
References 16 publications
0
34
0
Order By: Relevance
“…This paper uses a simplification of the second order Hessian-Free optimization [Martens, 2010] to estimate optimal values for the learning rate. In addition, we utilize some of the techniques described in Schaul et al [2013] and Gulcehre et al [2017]. Also, we show that adaptive learning rate methods such as Nesterov momentum [Sutskever et al, 2013, Nesterov, 1983, AdaDelta [Duchi et al, 2011], AdaGrad [Zeiler, 2012], and Adam [Kingma and Ba, 2014] do not use sufficiently large learning rates when they are effective nor do they lead to super-convergence.…”
Section: Introductionmentioning
confidence: 95%
See 1 more Smart Citation
“…This paper uses a simplification of the second order Hessian-Free optimization [Martens, 2010] to estimate optimal values for the learning rate. In addition, we utilize some of the techniques described in Schaul et al [2013] and Gulcehre et al [2017]. Also, we show that adaptive learning rate methods such as Nesterov momentum [Sutskever et al, 2013, Nesterov, 1983, AdaDelta [Duchi et al, 2011], AdaGrad [Zeiler, 2012], and Adam [Kingma and Ba, 2014] do not use sufficiently large learning rates when they are effective nor do they lead to super-convergence.…”
Section: Introductionmentioning
confidence: 95%
“…where δ should be in the direction of the steepest descent. The AdaSecant method [Gulcehre et al, 2014[Gulcehre et al, , 2017 builds an adaptive learning rate method based on this finite difference approximation as:…”
Section: Introductionmentioning
confidence: 99%
“…The cascaded training procedure has some similarities in spirit to the incremental training procedure of Zhou et al [31], but that work considered only DAs with one level of noise. Usefulness of varying the level of noise during training of neural nets was also noticed by Gulcehre et al [15], who add noise to the activation functions. Our training procedure also resembles the walkback training suggested by Bengio et al [5], however, we do not require our training loss to be interpretable as negative log-likelihood.…”
Section: Discussionmentioning
confidence: 80%
“…As one of the problems in the EMA scheme, the lack of robustness has been dealt with in [19] and [20]. In those methods, the exponential decay parameter of the EMA is increased whenever a value that falls beyond some boundary is encountered.…”
Section: ) Our Contributionmentioning
confidence: 99%
“…The main advantage of this approach is that it relies on the natural robustness of the student-t distribution and its ability to deal with outliers, and can easily be reduced to the conventional momentum for nonheavy-tailed data. Since the EMA-based first-order momentum is the key of the state-of-the-art SGD methods, our t-momentum can be integrated to various methods like Adam [7], RMSProp [17], VSGD-fd [19], Adasecant [20] or Adabound [18]. Specifically, in this article, we mainly focus on Adam with t-momentum, named t-Adam, to investigate its theoretical performance.…”
Section: ) Our Contributionmentioning
confidence: 99%