2020
DOI: 10.48550/arxiv.2005.07360
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Preetum Nakkiran

Abstract: Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li et al. (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems -in particular, linear regression in 2 dimensions.We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 8 publications
(9 reference statements)
0
6
1
Order By: Relevance
“…Our result also differs from[49][50][51] which analyze the effect of initial large learning rates 5. Here, H, W are height and width of the image, respectively.…”
contrasting
confidence: 99%
“…Our result also differs from[49][50][51] which analyze the effect of initial large learning rates 5. Here, H, W are height and width of the image, respectively.…”
contrasting
confidence: 99%
“…This is illustrated in Fig. 1, a figure inspired by Nakkiran [2020]. Our second contribution is to show that such a mismatch systematically occurs in simple classification scenarios with low noise, where the quantity of interest to minimize may not be the population risk, as discussed earlier.…”
Section: Summary Of Contributionsmentioning
confidence: 83%
“…Recently, different papers tried to reproduce this phenomenon in convex settings. This is probably thanks to the observation made by Nakkiran [2020], where a toy dataset is exhibited, which was the main motivation for this work. However, it fails to capture realistic scenarios where the data distribution is not isotropic, or with non linear data embeddings.…”
Section: Related Workmentioning
confidence: 90%
See 2 more Smart Citations