2020
DOI: 10.48550/arxiv.2011.14522
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Feature Learning in Infinite-Width Neural Networks

Abstract: As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametriz… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
59
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(61 citation statements)
references
References 31 publications
2
59
0
Order By: Relevance
“…and similarly for S source . 14 Therefore, the generating function for h, z factorizes into a product of N factors -which we shall denote Zi [j,] -allowing us to express the total average partition function in the form 15…”
Section: Self-averaging Random Networkmentioning
confidence: 99%
See 2 more Smart Citations
“…and similarly for S source . 14 Therefore, the generating function for h, z factorizes into a product of N factors -which we shall denote Zi [j,] -allowing us to express the total average partition function in the form 15…”
Section: Self-averaging Random Networkmentioning
confidence: 99%
“…) 14 The cumulant term requires moving the summation through both the exponential and the log, i.e., ln e y i = ln e y i = ln e y i = ln e y i . 15 Note that while we require N to be sufficiently large for the Gaussian distributions to be valid, the factorization itself holds even at finite N , since it relies only on each term in the summations over z2 i , φ(hi) 2 , and ϕ(xi) 2 being identical, which is true by virtue of the integrals over hi, zi in (2.29).…”
Section: Mean-field Theory Approximationmentioning
confidence: 99%
See 1 more Smart Citation
“…The theory of neural tangent kernel (NTK) has been deemed an important tool to understand deep neural networks [15][16][17][18][19][20][21]. In the large-width limit, a generic neural network becomes nearly Gaussian when averaging over the initial weights and biases, and the learning capabilities become predictable.…”
Section: Introductionmentioning
confidence: 99%
“…This scaling allows for nonlinear feature learning, unlike the NTK scaling [8]. While there are other scalings that also admit a certain sense of feature learning [7,31], the standard parameterization in practice -in the infinite-width limit -is known to degenerate into NTK-like behaviors, which are not expected of practical finite-but-large-width neural networks [13,31]. In other words, all infinite-width scalings that display feature learning are only proxies of practical networks.…”
Section: Introductionmentioning
confidence: 99%