2016
DOI: 10.1007/978-3-319-46128-1_1
|View full text |Cite
|
Sign up to set email alerts
|

adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

Abstract: Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on a plethora pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known "vanishing/exploding" gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
17
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 15 publications
0
17
0
Order By: Relevance
“…However, when large batches are employed (a regime that is favorable foe GPU computing), the multi-batch L-BFGS method performs on par with the other methods. Moreover, it appears that the performance of the multi-batch L-BFGS methods 20 The authors in [34] observed that the widely used Barzilai-Borwein-type scaling s T k yk y T k yk I of the initial Hessian approximation may lead to quasi-Newton updates that are not stable when small batch sizes are employed, especially for deep neural training tasks, and as such propose an Agadrad-like scaling of the initial BFGS matrix. To obviate this instability, we implement a variant of the multi-batch L-BFGS method (LBFGS2) in which we scale the initial Hessian approximations as αI.…”
Section: Neural Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…However, when large batches are employed (a regime that is favorable foe GPU computing), the multi-batch L-BFGS method performs on par with the other methods. Moreover, it appears that the performance of the multi-batch L-BFGS methods 20 The authors in [34] observed that the widely used Barzilai-Borwein-type scaling s T k yk y T k yk I of the initial Hessian approximation may lead to quasi-Newton updates that are not stable when small batch sizes are employed, especially for deep neural training tasks, and as such propose an Agadrad-like scaling of the initial BFGS matrix. To obviate this instability, we implement a variant of the multi-batch L-BFGS method (LBFGS2) in which we scale the initial Hessian approximations as αI.…”
Section: Neural Networkmentioning
confidence: 99%
“…Recently, several stochastic quasi-Newton (SQN) methods have been proposed; see e.g., [5,11,18,20,30,34,47,62,67]. The methods enumerated above differ in three major aspects: (i) the update rules for the curvature (correction) pairs and the Hessian approximation, (ii) the frequency of updating, and (iii) the required extra computational cost and synchronization required.…”
Section: Introductionmentioning
confidence: 99%
“…However, the high computational cost incurred in second-order methods still poses a major challenge, which further adds up in very long sequence modeling problems. Recent studies [9,12,15,16] propose algorithms that judiciously incorporate curvature information while taking the computation cost into consideration.…”
Section: Introductionmentioning
confidence: 99%
“…Momentum based methods and its conjunction with second-order methods have shown to significantly improve the performance and convergence speed [17][18][19]. The proposed method is similar to the framework of SQN [20] and adaQN [16] with some changes which are described in later sections. It combines the Nesterov's accelerated quasi-Newton (NAQ) method [19] and adaQN method [16], thus accelerating convergence and maintaining a low per-iteration cost.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation