Behrooz Ghorbani scite author profile

We consider the problem of learning an unknown function f on the d-dimensional sphere with respect to the square loss, given i.i.d. samples {(y i , x i )} i≤n where x i is a feature vector uniformly distributed on the sphere and y i = f (x i ) + ε i . We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons N diverges, for a fixed dimension d.We consider two specific regimes: the approximation-limited regime, in which n = ∞ while d and N are large but finite; and the sample size-limited regime in which N = ∞ while d and n are large but finite. In the first regime, we prove that if d +δ ≤ N ≤ d +1−δ for small δ > 0, then RF effectively fits a degree-polynomial in the raw features, and NT fits a degree-( +1) polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is d +δ ≤ n ≤ d +1−δ , then kernel methods can fit at most a a degree-polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization. Contents

show abstract

Linearized two-layers neural networks in high dimension

Ghorbani

Song²,

Misiakiewicz

et al. 2021

Ann. Statist.

View full text Add to dashboard Cite

When do neural networks outperform kernel methods?*

et al. 2021

View full text Add to dashboard Cite

For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

show abstract

Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function”

Ghorbani¹,

Mei²,

Misiakiewicz³

et al. 2020

Ann. Statist.

View full text Add to dashboard Cite

Sparse regression with highly correlated predictors

Ghorbani¹,

Yılmaz²

2015

Preprint

View full text Add to dashboard Cite

We consider a linear regression y = Xβ + u where X ∈ R n×p , p n, and β is s−sparse.Motivated by examples in financial and economic data, we consider the situation where X has highly correlated and clustered columns. To perform sparse recovery in this setting, we introduce the clustering removal algorithm (CRA), that seeks to decrease the correlation in X by removing the cluster structure without changing the parameter vector β. We show that as long as certain assumptions hold about X, the decorrelated matrix will satisfy the restricted isometry property (RIP) with high probability. We also provide examples of the empirical performance of CRA and compare it with other sparse recovery techniques.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.