Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound

Song, Zhao; Yang, Xin

doi:10.48550/arxiv.1906.03593

Cited by 36 publications

(56 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can relate the training and generalization behavior of dense and sparse models through their NTK. The standard result [Song and Yang, 2019] implies the following.…”

Section: F Neural Tangent Kernel Convergence and Generalizationmentioning

confidence: 84%

“…We build on the great literature of NTK [Li and Liang, 2018, Du et al, 2019, Allen-Zhu et al, 2019b. The standard result [Song and Yang, 2019] implies the following, if the NTK of the sparse model is close to the NTK of the dense model, then (i) their training convergence speed is similar, (ii) their generalization bounds are similar. For completeness, we state the formal result in Appendix F.…”

Section: Convergence and Generalization Of Sparse Networkmentioning

confidence: 99%

See 1 more Smart Citation

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Chen¹,

Dao²,

Liang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3× faster than butterfly and speeds up training to achieve favorable accuracy-efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5× faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy. * Equal contribution. Order determined by coin flip.

show abstract

“…We can relate the training and generalization behavior of dense and sparse models through their NTK. The standard result [Song and Yang, 2019] implies the following.…”

Section: F Neural Tangent Kernel Convergence and Generalizationmentioning

confidence: 84%

Section: Convergence and Generalization Of Sparse Networkmentioning

confidence: 99%

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Chen¹,

Dao²,

Liang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…There has been a deluge of works on the Neural Tangent Kernel since it was introduced by Jacot et al (2018), and thus we do our best to provide a partial list. Global convergence guarantees for the optimization, and to a lesser extent generalization, for networks polynomially wide in the number of training samples n and other parameters has been addressed in several works (Du et al, 2019b;Oymak & Soltanolkotabi, 2020;Du et al, 2019a;Allen-Zhu et al, 2019a,b;Zou et al, 2020;Zou & Gu, 2019;Song & Yang, 2020;Arora et al, 2019). To our knowledge, for the regression problem with arbitrary labels, quadratic overparameterization m n 2 is state-of-the art (Oymak & Soltanolkotabi, 2020;Song & Yang, 2020;Nguyen & Mondelli, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…Global convergence guarantees for the optimization, and to a lesser extent generalization, for networks polynomially wide in the number of training samples n and other parameters has been addressed in several works (Du et al, 2019b;Oymak & Soltanolkotabi, 2020;Du et al, 2019a;Allen-Zhu et al, 2019a,b;Zou et al, 2020;Zou & Gu, 2019;Song & Yang, 2020;Arora et al, 2019). To our knowledge, for the regression problem with arbitrary labels, quadratic overparameterization m n 2 is state-of-the art (Oymak & Soltanolkotabi, 2020;Song & Yang, 2020;Nguyen & Mondelli, 2020). E et al (2020) gave a fairly comprehensive study of optimization and generalization of shallow networks trained under the standard parameterization.…”

Section: Related Workmentioning

confidence: 99%

“…Once the NTK matrix has small enough deviations to remain strictly positive definite throughout training, the optimization dynamics start to become comparable to that of a linear model (Lee et al, 2019). For wide networks (quadratic or higher polynomial dependence on the number of training data samples n and other parameters) this property holds and this has been used by a variety of works to prove global convergence guarantees for the optimization (Du et al, 2019b;Oymak & Soltanolkotabi, 2020;Du et al, 2019a;Allen-Zhu et al, 2019a,b;Zou et al, 2020;Zou & Gu, 2019;Song & Yang, 2020;Dukler et al, 2020) 1 and to characterize the solution throughout time (Arora et al, 2019;Basri et al, 2020). The NTK has been so heavily exploited in this setting that it has become synonymous with polynomially wide networks where the NTK is strictly positive definite throughout training.…”

Section: Introductionmentioning

confidence: 98%

See 1 more Smart Citation

Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

Bowman¹,

Montúfar²

2022

Preprint

View full text Add to dashboard Cite

We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator TK∞ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere S d−1 and rotation invariant weight distributions, the eigenfunctions of TK∞ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.

show abstract

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes

Kawaguchi

Huang

2019

2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)

View full text Add to dashboard Cite

In this paper, we theoretically prove that gradient descent can find a global minimum for nonlinear deep neural networks of sizes commonly encountered in practice. The theory developed in this paper only requires the practical degrees of over-parameterization unlike previous theories. Our theory only requires the number of trainable parameters to increase linearly as the number of training samples increases. This allows the size of the deep neural networks to be consistent with practice and to be several orders of magnitude smaller than that required by the previous theories. Moreover, we prove that the linear increase of the size of the network is the optimal rate and that it cannot be improved, except by a logarithmic factor. Furthermore, deep neural networks with the trainability guarantee are shown to generalize well to unseen test samples with a natural dataset but not a random dataset.

show abstract

Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound

Abstract: We improve the over-parametrization size over two beautiful results [Li and Liang' 2018] and [Du, Zhai, Poczos and Singh' 2019] in deep learning theory.

Cited by 36 publications

References 17 publications

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes

Contact Info

Product

Resources

About