Learning Two Layer Rectified Neural Networks in Polynomial Time

Bakshi, Ainesh; Jayaram, Rajesh; Woodruff, David P.

doi:10.48550/arxiv.1811.01885

Cited by 6 publications

(12 citation statements)

References 24 publications

(68 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Learning Two-Layer Network [9,10,12,22,23,36,41,41,42,46,50,51,53,54,56,58,61]. There is a rich history of works considering the learnability of neural networks trained by SGD.…”

Section: More On Related Workmentioning

confidence: 99%

“…Define functions S 0 (x) = G 0 (x), S 1 (x) = G 1 (x), 9 as well as (it is convenient to think of those S(x) as the "features" used by learner network F (x))…”

Section: Learner Networkmentioning

confidence: 99%

“…is non-trivial due to the extreme non-convexity caused by the hierarchical structures in multi-layer networks. For such reason, it is not surprising that most existing theoretical works on the efficient learning regime of neural networks either focus on (1) two-layer networks [9,10,12,22,23,36,41,41,42,46,50,51,53,54,56,58,61] which do not have any hierarchical structure, or (2) a multi-layer network but essentially only the last layer is trained [14,33], or (3) reducing a multi-layer hierarchical neural network to non-hierarchical models such as kernel methods (a.k.a. the neural tangent kernel approach) [3, 5-8, 13, 13, 15, 17, 18, 25, 28, 35, 40, 43, 57, 62].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Backward Feature Correction: How Deep Learning Performs Deep Learning

Allen-Zhu¹,

Li²

2020

Preprint

122

View full text Add to dashboard Cite

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD) to the training objective.On the conceptual side, we present, to the best of our knowledge, the first theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when no known non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks.On the technical side, we show for regression and even for binary classification, for every input dimension d > 0, there is a concept class consisting of degree ω(1) multi-variate polynomials so that, using ω(1)-layer neural networks as learners, SGD can learn any target function from this class in poly(d) time using poly(d) samples to any 1 poly(d) regression or classification error, through learning to represent it as a composition of ω(1) layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from super-polynomial d ω(1) sample or time complexity to learn functions in this concept class even to any d −0.01 error.

show abstract

“…Learning Two-Layer Network [9,10,12,22,23,36,41,41,42,46,50,51,53,54,56,58,61]. There is a rich history of works considering the learnability of neural networks trained by SGD.…”

Section: More On Related Workmentioning

confidence: 99%

“…Define functions S 0 (x) = G 0 (x), S 1 (x) = G 1 (x), 9 as well as (it is convenient to think of those S(x) as the "features" used by learner network F (x))…”

Section: Learner Networkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Backward Feature Correction: How Deep Learning Performs Deep Learning

Allen-Zhu¹,

Li²

2020

Preprint

122

View full text Add to dashboard Cite

show abstract

“…Most existing works analyzing the learnability of neural networks [9,12,13,21,22,29,34,35,43,44,48,50,51,57] make unrealistic assumptions about the data distribution (such as being random Gaussian), and/or make strong assumptions about the network (such as using linear activations). Li and Liang [33] show that two-layer ReLU networks can learn classification tasks when the data come from mixtures of arbitrary but well-separated distributions.…”

Section: What Can Neural Network Provably Learn?mentioning

confidence: 99%

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Allen-Zhu¹,

Li²,

Liang³

2018

Preprint

210

View full text Add to dashboard Cite

The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained neural networks overfit when the it is overparameterized (namely, having more parameters than statistically needed to overfit training data)?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the overparameterized network. * V1 appears on this date, V2/V3/V4/V5 polish writing and parameters, V5 adds experiments. Authors sorted in alphabetical order. We would like to thank Greg Yang and Sebastien Bubeck for many enlightening conversations.

show abstract

“…It is necessary the negative result of kernel methods is distribution dependent, since for trivial distributions where x is non-zero only on the first constantly many coordinates, both neural networks and kernel methods can learn it with constantly many samples 7. If R(w) is the 2 regularizer, then this becomes a kernel method again since the minimizer can be written in the form (3.1).…”

mentioning

confidence: 99%

What Can ResNet Learn Efficiently, Going Beyond Kernels?

Allen-Zhu,

2019

Preprint

View full text Add to dashboard Cite

How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap?Recently, there is an influential line of work relating neural networks to kernels in the overparameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class better than kernels?We answer this positively in the PAC-learning language. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be much smaller than any kernel method, including neural tangent kernels (NTK).The main intuition is that multi-layer neural networks can implicitly perform hierarchal learning using different layers, which reduces the sample complexity comparing to "one-shot" learning algorithms such as kernel methods.In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings. * V1 appears on this date and V2 slightly improved the lower bound. We would like to thank Greg Yang for many enlightening conversations as well as discussions on neural tangent kernels.

show abstract

Learning Two Layer Rectified Neural Networks in Polynomial Time

Cited by 6 publications

References 24 publications

Backward Feature Correction: How Deep Learning Performs Deep Learning

Backward Feature Correction: How Deep Learning Performs Deep Learning

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

What Can ResNet Learn Efficiently, Going Beyond Kernels?

Contact Info

Product

Resources

About