Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Heckel, Reinhard; Yılmaz, Fatih

doi:10.48550/arxiv.2007.10099

Cited by 10 publications

(13 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Near critical parameterization, this gives rise to an epochwise double descent for noise levels above a threshold. This is similar to other linear models (recently [11,21]) that achieve the well-studed double descent in model complexity by assuming the existence of small scale (small eigenvalue) which couple to uniform noise. While not a focus of this work, we note that both effects can coexist if the small scale features are weakly affected by noise in comparison to the large scale features.…”

Section: Discussionsupporting

confidence: 80%

“…This effect can be removed by taking an appropriate learning rate for each feature, which lines up the different bias-variance tradeoffs in time, an effect which sometimes can improve generalization. Additionally, [21] shows analytically that similar effects occur in a two layer network, and experimentally demonstrates similar behavior in a 5-layer convolutional network.…”

Section: Related Workmentioning

confidence: 63%

“…Exploring the EMC hypothesis, [21] finds that in the case of linear regression, epochwise double descent occurs as a consequence of the gradient descent training dynamics when differently scaled features are learned at different times. In this setting, the test risk decomposes into a sum of biasvariance tradeoffs which occur at different times, resulting in an epochwise double descent.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

When and how epochwise double descent happens

Stephenson,

Lee

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks are known to exhibit a 'double descent' behavior as the number of parameters increases. Recently, it has also been shown that an 'epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time. This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization. In this work we develop an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur. This model is based on the hypothesis that the training data contains features that are slow to learn but informative. We then show experimentally that deep neural networks behave similarly to our theoretical model. Our findings indicate that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective. Using insights from theory, we give two methods by which epochwise double descent can be removed: one that removes slow to learn features from the input and reduces generalization performance, and another that instead modifies the training dynamics and matches or exceeds the generalization performance of standard training. Taken together, our results suggest a new picture of how epochwise double descent emerges from the interplay between the dynamics of training and noise in the training data. * Work done at Intel Labs Preprint. Under review.

show abstract

Section: Discussionsupporting

confidence: 80%

Section: Related Workmentioning

confidence: 63%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

When and how epochwise double descent happens

Stephenson,

Lee

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This drawback is therefore unlikely to cause a problem for gradient disparity when it is used as an early stopping criterion. Nevertheless, as a future direction, it would be interesting to explore this further especially for scenarios such as epoch-wise double-descent [20]. .…”

Section: Discussion and Final Remarksmentioning

confidence: 99%

Disparity Between Batches as a Signal for Early Stopping

Forouzesh¹,

Thiran²

2021

Preprint

View full text Add to dashboard Cite

We propose a metric for evaluating the generalization ability of deep neural networks trained with mini-batch gradient descent. Our metric, called gradient disparity, is the 2 norm distance between the gradient vectors of two mini-batches drawn from the training set. It is derived from a probabilistic upper bound on the difference between the classification errors over a given mini-batch, when the network is trained on this mini-batch and when the network is trained on another mini-batch of points sampled from the same dataset. We empirically show that gradient disparity is a very promising early-stopping criterion (i) when data is limited, as it uses all the samples for training and (ii) when available data has noisy labels, as it signals overfitting better than the validation data. Furthermore, we show in a wide range of experimental settings that gradient disparity is strongly related to the generalization error between the training and test sets, and that it is also very informative about the level of label noise.

show abstract

“…(2) For residual DNNs with skip-connections and a few traditional DNNs (like the VGG-16 trained on the Pascal VOC dataset; see Appendix G.4), the complexity increased monotonously during the early stage of the training process, and saturated later. This indicated that noisy features had little effect on DNNs with skip-connections in early stages of the learning process, which implied the temporal double-descent phenomenon [18,38].…”

Section: Comparative Studies To Diagnose the Representation Capacity ...mentioning

confidence: 99%

Towards Theoretical Analysis of Transformation Complexity of ReLU DNNs

Ren¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper aims to theoretically analyze the complexity of feature transformations encoded in DNNs with ReLU layers. We propose metrics to measure three types of complexities of transformations based on the information theory. We further discover and prove the strong correlation between the complexity and the disentanglement of transformations. Based on the proposed metrics, we analyze two typical phenomena of the change of the transformation complexity during the training process, and explore the ceiling of a DNN's complexity. The proposed metrics can also be used as a loss to learn a DNN with the minimum complexity, which also controls the over-fitting level of the DNN and influences adversarial robustness, adversarial transferability, and knowledge consistency. Comprehensive comparative studies have provided new perspectives to understand the DNN.

show abstract

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Cited by 10 publications

References 9 publications

When and how epochwise double descent happens

When and how epochwise double descent happens

Disparity Between Batches as a Signal for Early Stopping

Towards Theoretical Analysis of Transformation Complexity of ReLU DNNs

Contact Info

Product

Resources

About