Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks

Cao, Yuan; Gu, Quanquan

doi:10.1609/aaai.v34i04.5736

Cited by 142 publications

(219 citation statements)

References 6 publications

Supporting

Mentioning

207

Contrasting

Order By: Relevance

“…need not be convex (even when (•) is). It has been argued in several recent papers that in highly overparameterized neural networks, because W is very high dimensional, any random initialization w 0 is close to it, with high probability [20], [22]- [25] (see also the discussion in Appendix A in the Supplementary Material). In such settings, it is reasonable to make the following assumption about the manifold.…”

Section: B Main Resultsmentioning

confidence: 99%

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Azizan

Lale

Hassibi

2022

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Most modern learning problems are highly overparameterized, i.e., have many more model parameters than the number of training data points. As a result, the training loss may have infinitely many global minima (parameter vectors that perfectly "interpolate" the training data). It is therefore imperative to understand which interpolating solutions we converge to, how they depend on the initialization and learning algorithm, and whether they yield different test errors. In this article, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which stochastic gradient descent (SGD) is a special case. Recently, it has been shown that for overparameterized linear models, SMD converges to the closest global minimum to the initialization point, where closeness is in terms of the Bregman divergence corresponding to the potential function of the mirror descent. With appropriate initialization, this yields convergence to the minimum-potential interpolating solution, a phenomenon referred to as implicit regularization. On the theory side, we show that for sufficientlyoverparameterized nonlinear models, SMD with a (small enough) fixed step size converges to a global minimum that is "very close" (in Bregman divergence) to the minimum-potential interpolating solution, thus attaining approximate implicit regularization. On the empirical side, our experiments on the MNIST and CIFAR-10 datasets consistently confirm that the above phenomenon occurs in practical scenarios. They further indicate a clear difference in the generalization performances of different SMD algorithms: experiments on the CIFAR-10 dataset with different regularizers, 1 to encourage sparsity, 2 (SGD) to encourage small Euclidean norm, and ∞ to discourage large components, surprisingly show that the ∞ norm consistently yields better generalization performance than SGD, which in turn generalizes better than the 1 norm.

show abstract

Section: B Main Resultsmentioning

confidence: 99%

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Azizan

Lale

Hassibi

2022

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Only the threshold of the output node is fuzzified according to Theorems 1-3. The values of the other parameters are derived by training the FDNN as a crisp deep neural network using the commonly used GD or LM algorithm [3,20].…”

Section: Nd Hidden Layermentioning

confidence: 99%

Fuzzy dynamic-prioritization agent-based system for forecasting job cycle time in a wafer fabrication plant

Chen

Wang

2021

Complex Intell. Syst.

View full text Add to dashboard Cite

A fuzzy dynamic-prioritization agent-based system was developed in this study to improve the forecasting of the cycle time of a job in a wafer fabrication plant (wafer fab). In this system, multiple fuzzy agents forecast the cycle time of a job from various viewpoints, after which the aggregation and evaluation agent aggregates these fuzzy cycle time forecasts using an innovative operator (i.e., the fuzzy weighted intersection) into a single representative value. Subsequently, the optimization agent varies the authority levels of the fuzzy cycle time forecasting agents to optimize the forecasting performance. A practical example was used to evaluate the effectiveness of the fuzzy dynamic-prioritization agent-based system. The experiment results indicated that the fuzzy dynamic-prioritization agent-based system outperformed three rival methods in improving forecasting accuracy. In addition, the forecasting performance could be enhanced by discriminating the authority levels of the fuzzy cycle time forecasting agents.

show abstract

“…On the one hand, in the over-parameterized regime with s ≥ n, it has been observed that these neural networks exhibit certain intriguing phenomena such as the ability to fit random labels [10] and double descent [11]. Theoretical results [12], [13], [14], [15] for random features can be leveraged to explain these phenomena and provide an analysis of two-layer overparameterized neural networks. On the other hand, the random features model is a powerful tool for scaling up traditional kernel methods [16], [17], neural tangent kernel [12], [18], [19], graph neural networks [20], [21], and attention in Transformers [22], [23].…”

Section: Introductionmentioning

confidence: 99%

Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond

Liu

Huang

Chen

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.

show abstract

Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks

Cited by 142 publications

References 6 publications

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Fuzzy dynamic-prioritization agent-based system for forecasting job cycle time in a wafer fabrication plant

Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond

Contact Info

Product

Resources

About