Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Nakkiran, Preetum

doi:10.48550/arxiv.2005.07360

Cited by 4 publications

(7 citation statements)

References 8 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our result also differs from[49][50][51] which analyze the effect of initial large learning rates 5. Here, H, W are height and width of the image, respectively.…”

contrasting

confidence: 99%

Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

You,

Zhu,

et al. 2020

Preprint

View full text Add to dashboard Cite

Recent advances have shown that implicit bias of gradient descent on overparameterized models enables the recovery of low-rank matrices from linear measurements, even with no prior knowledge on the intrinsic rank. In contrast, for robust low-rank matrix recovery from grossly corrupted measurements, overparameterization leads to overfitting without prior knowledge on both the intrinsic rank and sparsity of corruption. This paper shows that with a double overparameterization for both the low-rank matrix and sparse corruption, gradient descent with discrepant learning rates provably recovers the underlying matrix even without prior knowledge on neither rank of the matrix nor sparsity of the corruption. We further extend our approach for the robust recovery of natural images by over-parameterizing images with deep convolutional networks. Experiments show that our method handles different test images and varying corruption levels with a single learning pipeline where the network width and termination conditions do not need to be adjusted on a case-by-case basis. Underlying the success is again the implicit bias with discrepant learning rates on different over-parameterized parameters, which may bear on broader applications.

show abstract

“…Our result also differs from[49][50][51] which analyze the effect of initial large learning rates 5. Here, H, W are height and width of the image, respectively.…”

contrasting

confidence: 99%

Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

You,

Zhu,

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…This is illustrated in Fig. 1, a figure inspired by Nakkiran [2020]. Our second contribution is to show that such a mismatch systematically occurs in simple classification scenarios with low noise, where the quantity of interest to minimize may not be the population risk, as discussed earlier.…”

Section: Summary Of Contributionsmentioning

confidence: 83%

“…Recently, different papers tried to reproduce this phenomenon in convex settings. This is probably thanks to the observation made by Nakkiran [2020], where a toy dataset is exhibited, which was the main motivation for this work. However, it fails to capture realistic scenarios where the data distribution is not isotropic, or with non linear data embeddings.…”

Section: Related Workmentioning

confidence: 90%

“…[see, for instance Smith et al, 2021, Jastrzebski et al, 2021, 2020; common strategies consist of using first a large learning rate, before annealing it to a smaller value. As a first step towards proving theoretically the effect of choosing large learning rates for training neural networks, Li et al [2019] devise a two-layer neural network model with different set of features where the order in which they are learnt matters, where the previous annealing strategy could be shown to be useful in theory.…”

Section: Related Workmentioning

confidence: 99%

“…( 1) with plain gradient descent, starting from a vector θ 0 in H with step-size η, and we distinguish between two cases: having a small learning rate η s or a large learning rate η b , the range of both is to be detailed later. A simple intuition was suggested by Nakkiran [2020] on a two-dimensional toy problem, showing that large learning rates may be beneficial as soon as there is a mismatch between F and R (meaning, what we train on does not correspond to what we test on). We show that such an insight can be extended beyond toy problems to realistic scenarios with traditional kernel methods, and that, perhaps surprisingly, this phenomenon occurs already in simple classification tasks.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Benefits of Large Learning Rates for Kernel Methods

Beugnot¹,

Mairal²,

Rudi³

2022

Preprint

View full text Add to dashboard Cite

This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that such a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran [2020] on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.

show abstract

A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

Ren

Zhao

et al. 2021

Applied Sciences

View full text Add to dashboard Cite

Due to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on fine-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many first-order adaptive methods (e.g., Adam, Adagrad) have been proposed to adjust learning rate based on gradients, they are susceptible to the initial learning rate and network architecture. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyper-parameters. To address this, we propose a heuristic zeroth-order learning rate method, Adacomp, which adaptively adjusts the learning rate based only on values of the loss function. The main idea is that Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp is robust to the initial learning rate. Extensive experiments, including comparison to six typically adaptive methods (Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax) on several benchmark datasets for image classification tasks (MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100), were conducted. Experimental results show that Adacomp is not only robust to the initial learning rate but also to the network architecture, network initialization, and batch size.

show abstract

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Cited by 4 publications

References 8 publications

Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization

On the Benefits of Large Learning Rates for Kernel Methods

A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

Contact Info

Product

Resources

About