adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

Keskar, Nitish Shirish; Berahas, Albert S.

doi:10.1007/978-3-319-46128-1_1

Cited by 20 publications

(17 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, when large batches are employed (a regime that is favorable foe GPU computing), the multi-batch L-BFGS method performs on par with the other methods. Moreover, it appears that the performance of the multi-batch L-BFGS methods 20 The authors in [34] observed that the widely used Barzilai-Borwein-type scaling s T k yk y T k yk I of the initial Hessian approximation may lead to quasi-Newton updates that are not stable when small batch sizes are employed, especially for deep neural training tasks, and as such propose an Agadrad-like scaling of the initial BFGS matrix. To obviate this instability, we implement a variant of the multi-batch L-BFGS method (LBFGS2) in which we scale the initial Hessian approximations as αI.…”

Section: Neural Networkmentioning

confidence: 99%

“…Recently, several stochastic quasi-Newton (SQN) methods have been proposed; see e.g., [5,11,18,20,30,34,47,62,67]. The methods enumerated above differ in three major aspects: (i) the update rules for the curvature (correction) pairs and the Hessian approximation, (ii) the frequency of updating, and (iii) the required extra computational cost and synchronization required.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A robust multi-batch L-BFGS method for machine learning

Berahas

Takáč

2019

Optimization Methods and Software

Self Cite

View full text Add to dashboard Cite

This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.

show abstract

Section: Neural Networkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A robust multi-batch L-BFGS method for machine learning

Berahas

Takáč

2019

Optimization Methods and Software

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, the high computational cost incurred in second-order methods still poses a major challenge, which further adds up in very long sequence modeling problems. Recent studies [9,12,15,16] propose algorithms that judiciously incorporate curvature information while taking the computation cost into consideration.…”

Section: Introductionmentioning

confidence: 99%

“…Momentum based methods and its conjunction with second-order methods have shown to significantly improve the performance and convergence speed [17][18][19]. The proposed method is similar to the framework of SQN [20] and adaQN [16] with some changes which are described in later sections. It combines the Nesterov's accelerated quasi-Newton (NAQ) method [19] and adaQN method [16], thus accelerating convergence and maintaining a low per-iteration cost.…”

Section: Introductionmentioning

confidence: 99%

“…The proposed method is similar to the framework of SQN [20] and adaQN [16] with some changes which are described in later sections. It combines the Nesterov's accelerated quasi-Newton (NAQ) method [19] and adaQN method [16], thus accelerating convergence and maintaining a low per-iteration cost. This paper attempts to study the performance of the proposed aSNAQ algorithm in extension to our previous work in [21] and verify its effectiveness on larger problems and networks.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

aSNAQ: An adaptive stochastic Nesterov's accelerated quasi-Newton method for training RNNs

Indrapriyadarsini

Mahboubi

Ninomiya

et al. 2020

NOLTA

View full text Add to dashboard Cite

Recurrent Neural Networks (RNNs) are powerful sequence models that are particularly difficult to train. This paper proposes an adaptive stochastic Nesterov's accelerated quasi-Newton (aSNAQ) method for training RNNs. Several algorithms have been proposed earlier for training RNNs. However, due to high computational complexity, very few methods use second-order curvature information despite its ability to improve convergence. The proposed method is an accelerated second-order method that attempts to incorporate curvature information while maintaining a low per iteration cost. Furthermore, direction normalization has been introduced to solve the vanishing and/or exploding gradient problem that is prominent in training RNNs. The performance of the proposed method is evaluated in Tensorflow on benchmark sequence modeling problems. The results show that the proposed aSNAQ method is effective in training RNNs with a low per-iteration cost and improved performance compared to the second-order adaQN and first-order Adagrad and Adam methods.

show abstract

A novel adaptive cubic quasi‐Newton optimizer for deep learning based medical image analysis tasks, validated on detection of COVID‐19 and segmentation for COVID‐19 lung infection, liver tumor, and optic disc/cup

et al. 2022

View full text Add to dashboard Cite

Background Most of existing deep learning research in medical image analysis is focused on networks with stronger performance. These networks have achieved success, while their architectures are complex and even contain massive parameters ranging from thousands to millions in numbers. The nature of high dimension and nonconvex makes it easy to train a suboptimal model through the popular stochastic first‐order optimizers, which only use gradient information. Purpose Our purpose is to design an adaptive cubic quasi‐Newton optimizer, which could help to escape from suboptimal solution and improve the performance of deep neural networks on four medical image analysis tasks including: detection of COVID‐19, COVID‐19 lung infection segmentation, liver tumor segmentation, optic disc/cup segmentation. Methods In this work, we introduce a novel adaptive cubic quasi‐Newton optimizer with high‐order moment (termed ACQN‐H) for medical image analysis. The optimizer dynamically captures the curvature of the loss function by diagonally approximated Hessian and the norm of difference between previous two estimates, which helps to escape from saddle points more efficiently. In addition, to reduce the variance introduced by the stochastic nature of the problem, ACQN‐H hires high‐order moment through exponential moving average on iteratively calculated approximated Hessian matrix. Extensive experiments are performed to access the performance of ACQN‐H. These include detection of COVID‐19 using COVID‐Net on dataset COVID‐chestxray, which contains 16 565 training samples and 1841 test samples; COVID‐19 lung infection segmentation using Inf‐Net on COVID‐CT, which contains 45, 5, and 5 computer tomography (CT) images for training, validation, and testing, respectively; liver tumor segmentation using ResUNet on LiTS2017, which consists of 50 622 abdominal scan images for training and 26 608 images for testing; optic disc/cup segmentation using MRNet on RIGA, which has 655 color fundus images for training and 95 for testing. The results are compared with commonly used stochastic first‐order optimizers such as Adam, SGD, and AdaBound, and recently proposed stochastic quasi‐Newton optimizer Apollo. In task detection of COVID‐19, we use classification accuracy as the evaluation metric. For the other three medical image segmentation tasks, seven commonly used evaluation metrics are utilized, that is, Dice, structure measure, enhanced‐alignment measure (EM), mean absolute error (MAE), intersection over union (IoU), true positive rate (TPR), and true negative rate. Results Experiments on four tasks show that ACQN‐H achieves improvements over other stochastic optimizers: (1) comparing with AdaBound, ACQN‐H achieves 0.49%, 0.11%, and 0.70% higher accuracy on the COVID‐chestxray dataset using network COVID‐Net with VGG16, ResNet50 and DenseNet121 as backbones, respectively; (2) ACQN‐H has the best scores in terms of evalua...

show abstract

adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

Cited by 20 publications

References 15 publications

A robust multi-batch L-BFGS method for machine learning

A robust multi-batch L-BFGS method for machine learning

aSNAQ: An adaptive stochastic Nesterov's accelerated quasi-Newton method for training RNNs

A novel adaptive cubic quasi‐Newton optimizer for deep learning based medical image analysis tasks, validated on detection of COVID‐19 and segmentation for COVID‐19 lung infection, liver tumor, and optic disc/cup

Contact Info

Product

Resources

About