2020
DOI: 10.1088/1742-5468/ab633c
|View full text |Cite
|
Sign up to set email alerts
|

Scaling description of generalization with number of parameters in deep learning

Abstract: Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N , in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as N grows past a certain threshold N * . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N . We resolve this paradox through a new framework. We re… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

10
104
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 108 publications
(114 citation statements)
references
References 57 publications
10
104
0
Order By: Relevance
“…Beyond the jamming point, the accuracy keeps steadily improving as the number of parameters increases [25,26,27], although it does so quite slowly. We have provided a quantitative explanation for this phenomenon in [33]. In .…”
Section: Generalization At and Beyond Jammingmentioning
confidence: 77%
See 1 more Smart Citation
“…Beyond the jamming point, the accuracy keeps steadily improving as the number of parameters increases [25,26,27], although it does so quite slowly. We have provided a quantitative explanation for this phenomenon in [33]. In .…”
Section: Generalization At and Beyond Jammingmentioning
confidence: 77%
“…Further theoretical studies on regression showed a precise mathematical description of the cusp behaviour in [30,31,32], albeit on models that are practically somewhat further away from modern neural networks. Finally, in [33], our subsequent work, we develop a quantitative theory for (i) the cusp which is associated to the divergence of the norm of the output function at the critical point of the phase transition and (ii) the asymptotic behaviour of the generalization error as N → ∞ which is associated with the reduced fluctuations of the output function. That work also shows that after ensemble averaging several networks, performance is optimal near the jamming the threshold, emphasizing the practical importance of this transition.…”
Section: Generalization Versus Over-fittingmentioning
confidence: 99%
“…In practice, users stop learning at finite times (which is not needed for the hinge loss where the dynamics really stops in the overparametrized regime when the loss vanishes). Working at finite time however blurs true critical behavior near jamming, as discussed in [47].…”
Section: Cross-entropy Lossmentioning
confidence: 99%
“…It would be interesting to investigate whether the learning dynamics, at intermediate times where many data are not fitted yet, resembles the dynamics near threshold and displays bursts of changes in the constraints. Concerning the latter, we have studied the effect of jamming on generalization since this article was first written, as appears in [46,47]. 10.…”
Section: Role Of Depthmentioning
confidence: 99%
See 1 more Smart Citation