2018
DOI: 10.48550/arxiv.1810.13243
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

Abstract: The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dime… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
46
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 48 publications
(47 citation statements)
references
References 18 publications
(28 reference statements)
1
46
0
Order By: Relevance
“…A recent tool that e ectively overcomes these challenges is (Singular Vector) Canonical Correlation Analysis, (SV)CCA [22,19], which has been used to study latent representations through training, across di erent models, alternate training objectives, and other properties [22,19,25,18,8,17,30].…”
Section: Representational Analysis Of the E Ects Of Transfermentioning
confidence: 99%
“…A recent tool that e ectively overcomes these challenges is (Singular Vector) Canonical Correlation Analysis, (SV)CCA [22,19], which has been used to study latent representations through training, across di erent models, alternate training objectives, and other properties [22,19,25,18,8,17,30].…”
Section: Representational Analysis Of the E Ects Of Transfermentioning
confidence: 99%
“…Then we conduct the meta update on MetaTTE and calculate θ f (Line 11). Similar to Reptile, we develop the adaptive algorithm as a linear learning rate scheduler [29] formulated as:…”
Section: Meta Learning Based Optimization Algorithmmentioning
confidence: 99%
“…We apply a warm-up schedule on the learning rate for 10% of the total epochs (i.e. for 5 epochs) and reach a maximum learning rate of 0.3, after which we apply cosine decay to the learning rate [13].…”
Section: A Self-supervised Feature Learningmentioning
confidence: 99%