2020
DOI: 10.1609/aaai.v34i04.5736
|View full text |Cite
|
Sign up to set email alerts
|

Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks

Abstract: Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generali… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

5
207
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
9
1

Relationship

1
9

Authors

Journals

citations
Cited by 142 publications
(219 citation statements)
references
References 6 publications
5
207
0
Order By: Relevance
“…need not be convex (even when (•) is). It has been argued in several recent papers that in highly overparameterized neural networks, because W is very high dimensional, any random initialization w 0 is close to it, with high probability [20], [22]- [25] (see also the discussion in Appendix A in the Supplementary Material). In such settings, it is reasonable to make the following assumption about the manifold.…”
Section: B Main Resultsmentioning
confidence: 99%
“…need not be convex (even when (•) is). It has been argued in several recent papers that in highly overparameterized neural networks, because W is very high dimensional, any random initialization w 0 is close to it, with high probability [20], [22]- [25] (see also the discussion in Appendix A in the Supplementary Material). In such settings, it is reasonable to make the following assumption about the manifold.…”
Section: B Main Resultsmentioning
confidence: 99%
“…Only the threshold of the output node is fuzzified according to Theorems 1-3. The values of the other parameters are derived by training the FDNN as a crisp deep neural network using the commonly used GD or LM algorithm [3,20].…”
Section: Nd Hidden Layermentioning
confidence: 99%
“…On the one hand, in the over-parameterized regime with s ≥ n, it has been observed that these neural networks exhibit certain intriguing phenomena such as the ability to fit random labels [10] and double descent [11]. Theoretical results [12], [13], [14], [15] for random features can be leveraged to explain these phenomena and provide an analysis of two-layer overparameterized neural networks. On the other hand, the random features model is a powerful tool for scaling up traditional kernel methods [16], [17], neural tangent kernel [12], [18], [19], graph neural networks [20], [21], and attention in Transformers [22], [23].…”
Section: Introductionmentioning
confidence: 99%