2020
DOI: 10.48550/arxiv.2007.10099
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Abstract: Over-parameterized models, in particular deep networks, often exhibit a double descent phenomenon, where as a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs, and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more bias-variance… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
12
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 9 publications
1
12
0
Order By: Relevance
“…Near critical parameterization, this gives rise to an epochwise double descent for noise levels above a threshold. This is similar to other linear models (recently [11,21]) that achieve the well-studed double descent in model complexity by assuming the existence of small scale (small eigenvalue) which couple to uniform noise. While not a focus of this work, we note that both effects can coexist if the small scale features are weakly affected by noise in comparison to the large scale features.…”
Section: Discussionsupporting
confidence: 80%
See 2 more Smart Citations
“…Near critical parameterization, this gives rise to an epochwise double descent for noise levels above a threshold. This is similar to other linear models (recently [11,21]) that achieve the well-studed double descent in model complexity by assuming the existence of small scale (small eigenvalue) which couple to uniform noise. While not a focus of this work, we note that both effects can coexist if the small scale features are weakly affected by noise in comparison to the large scale features.…”
Section: Discussionsupporting
confidence: 80%
“…This effect can be removed by taking an appropriate learning rate for each feature, which lines up the different bias-variance tradeoffs in time, an effect which sometimes can improve generalization. Additionally, [21] shows analytically that similar effects occur in a two layer network, and experimentally demonstrates similar behavior in a 5-layer convolutional network.…”
Section: Related Workmentioning
confidence: 63%
See 1 more Smart Citation
“…This drawback is therefore unlikely to cause a problem for gradient disparity when it is used as an early stopping criterion. Nevertheless, as a future direction, it would be interesting to explore this further especially for scenarios such as epoch-wise double-descent [20]. .…”
Section: Discussion and Final Remarksmentioning
confidence: 99%
“…(2) For residual DNNs with skip-connections and a few traditional DNNs (like the VGG-16 trained on the Pascal VOC dataset; see Appendix G.4), the complexity increased monotonously during the early stage of the training process, and saturated later. This indicated that noisy features had little effect on DNNs with skip-connections in early stages of the learning process, which implied the temporal double-descent phenomenon [18,38].…”
Section: Comparative Studies To Diagnose the Representation Capacity ...mentioning
confidence: 99%