Linearized two-layers neural networks in high dimension

Ghorbani, Behrooz; Song, Mei; Misiakiewicz, Theodor; Montanari, Andrea

doi:10.48550/arxiv.1904.12191

Cited by 57 publications

(106 citation statements)

References 12 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…On the one hand, Liang et al [27] prove asymptotic consistency for ground truth functions with asymptotically bounded Hilbert norms for neural tangent kernels (NTK) and inner product (IP) kernels. In contrast, Ghorbani et al [17,18] show that for uniform distributions on the product of two spheres, consistency cannot be achieved unless the ground truth is a low-degree polynomial. This polynomial approximation barrier can also be observed for random feature and neural tangent regression [17,31,32].…”

Section: Introductionmentioning

confidence: 97%

“…In contrast, Ghorbani et al [17,18] show that for uniform distributions on the product of two spheres, consistency cannot be achieved unless the ground truth is a low-degree polynomial. This polynomial approximation barrier can also be observed for random feature and neural tangent regression [17,31,32].…”

Section: Introductionmentioning

confidence: 97%

“…While recent work [2,12,19] establishes explicit asymptotic upper bounds for the bias and variance for high-dimensional linear regression, the results for kernel regression are less conclusive in the regime d/n β → c with β ∈ (0, 1). In particular, even though several papers [17,18,27] show that the variance decreases with the dimensionality of the data, the bounds on the bias are somewhat inconclusive. On the one hand, Liang et al [27] prove asymptotic consistency for ground truth functions with asymptotically bounded Hilbert norms for neural tangent kernels (NTK) and inner product (IP) kernels.…”

Section: Introductionmentioning

confidence: 99%

“…Notably, the two seemingly contradictory consistency results hold for different distributional settings and are based on vastly different proof techniques. While [27] proves consistency for general input distributions including isotropic Gaussians, the lower bounds in the papers [17,18] are limited to data that is uniformly sampled from the product of two spheres. Hence, it is a natural question to ask whether the polynomial approximation barrier is a more general phenomenon or restricted to the explicit settings studied in [17,18].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

How rotational invariance of common kernels prevents generalization in high dimensions

Donhauser,

Wu,

Yang

2021

Preprint

View full text Add to dashboard Cite

Kernel ridge regression is well-known to achieve minimax optimal rates in low-dimensional settings. However, its behavior in high dimensions is much less understood. Recent work establishes consistency for kernel regression under certain assumptions on the ground truth function and the distribution of the input data. In this paper, we show that the rotational invariance property of commonly studied kernels (such as RBF, inner product kernels and fully-connected NTK of any depth) induces a bias towards low-degree polynomials in high dimensions. Our result implies a lower bound on the generalization error for a wide range of distributions and various choices of the scaling for kernels with different eigenvalue decays. This lower bound suggests that general consistency results for kernel ridge regression in high dimensions require a more refined analysis that depends on the structure of the kernel beyond its eigenvalue decay.

show abstract

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

How rotational invariance of common kernels prevents generalization in high dimensions

Donhauser,

Wu,

Yang

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the extreme non-linearity of the networks in both the generator and the discriminator, it is highly unlikely that the training objective of GANs can be convex-concave. In particular, even if the generator and the discriminator are linear functions over prescribed feature mappings-such as the neural tangent kernel (NTK) feature mappings [3,8,9,17,18,32,35,40,41,47,51,54,65,69,92,97] -the training objective can still be non-convex-concave. 1 Even worse, unlike supervised learning where some non-convex learning problems can be shown to have no bad local minima [44], to the best of our knowledge, it still remains unclear what the qualities are of those critical points in GANs except in the most simple setting when the generator is a one-layer neural network [42,62].…”

Section: Introductionmentioning

confidence: 99%

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

Allen-Zhu

2021

Preprint

View full text Add to dashboard Cite

Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions. However, in theory, due to the highly non-convex, non-concave landscape of the minmax training objective, GAN remains one of the least understood deep learning models. In this work, we formally study how GANs can efficiently learn certain hierarchically generated distributions that are close to the distribution of images in practice. We prove that when a distribution has a structure that we refer to as forward superresolution, then simply training generative adversarial networks using gradient descent ascent (GDA) can indeed learn this distribution efficiently, both in terms of sample and time complexities. We also provide concrete empirical evidence that not only our assumption "forward super-resolution" is very natural in practice, but also the underlying learning mechanisms that we study in this paper (to allow us efficiently train GAN via GDA in theory) simulates the actual learning process of GANs in practice on real-world problems.

show abstract

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Dou

Liang

2020

Journal of the American Statistical Association

View full text Add to dashboard Cite

Consider the problem: given the data pair px, yq drawn from a population with f˚pxq " Ery|x " xs, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does f t , the function computed by the neural network at time t, relate to f˚, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for f˚lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks.

show abstract

Linearized two-layers neural networks in high dimension

Cited by 57 publications

References 12 publications

How rotational invariance of common kernels prevents generalization in high dimensions

How rotational invariance of common kernels prevents generalization in high dimensions

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits

Contact Info

Product

Resources

About