Harmless interpolation of noisy data in regression

Muthukumar, Vidya; Sahai, Anant

doi:10.1109/isit.2019.8849614

Cited by 58 publications

(123 citation statements)

References 15 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…However, these calculations do not elucidate several crucial statistical phenomena, which are instead the main contribution of our work (see Section 2): optimality of large overparametrization, optimality of interpolators at high SNR ( 3 0 limit), the role of self-induced regularization, and the disappearance of the double descent at optimal overparametrization. Rate-optimal bounds on the generalization error of overparametrized linear models were recently derived in [12] (see also [51] for a different perspective).…”

Section: Learning Via Interpolationmentioning

confidence: 99%

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Song

Montanari

2021

Comm Pure Appl Math

129

116

View full text Add to dashboard Cite

Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data.This phenomenon has been rationalized in terms of a so-called 'double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.In this paper we consider the problem of learning an unknown function over the d -dimensional sphere S d 1 , from n i.i.d. samples .x i ; y i / P S d 1 ¢ R, i n. We perform ridge regression on N random features of the form .w T a x/, a N . This can be equivalently described as a two-layer neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit N; n; d 3 I with N=d and n=d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures. In particular, above a critical value of the signal-to-noise ratio, minimum test error is achieved by extremely overparametrized interpolators, i.e., networks that have a number of parameters much larger than the sample size, and vanishing training error.

show abstract

Section: Learning Via Interpolationmentioning

confidence: 99%

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Song

Montanari

2021

Comm Pure Appl Math

129

116

View full text Add to dashboard Cite

show abstract

“…The initial version of this article [5] appeared concurrently with works of Hastie et al [11], Muthukumar et al [15], and Bartlett et al [3], all of which also study the behavior of the least squares/least norm predictor in overparameterized linear regression. Muthukumar et al [15] focus on the well-specified scenario (essentially p = D) and provide upper bounds on the risk that go to zero as p \rightar \infty . (A related variance analysis was carried out by Neal et al [16].)…”

mentioning

confidence: 99%

Two Models of Double Descent for Weak Features

Belkin¹,

Hsu²,

Ji³

2020

SIAM Journal on Mathematics of Data Science

158

128

View full text Add to dashboard Cite

“…1.2), our scheme applies eigen-weighting matrix Λ * to incentivize the optimizer to place higher weight on promising eigen-directions. This eigen-weighting procedure has been shown in the single-task case to be extremely crucial to avail the benefit of overparameterization [6,30,33]: it captures an inductive bias that promotes certain features and demotes others. We show that the importance of eigen-weighting extends to the multi-task case as well.…”

Section: Our Contributionsmentioning

confidence: 99%

“…Overparameterized ML and double-descent The phenomenon of double-descent was first discovered by [6]. This paper and subsequent works on this topic [4,33,32,30,11] emphasize the importance of the right prior (sometimes referred to as inductive bias or regularization) to avail the benefits of overparameterization. However, an important question that arises is: where does this prior come from?…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Towards Sample-efficient Overparameterized Meta-learning

Sun

Narang²,

Gulluk³

et al. 2022

Preprint

View full text Add to dashboard Cite

An overarching goal in machine learning is to build a generalizable model with few samples. To this end, overparameterization has been the subject of immense interest to explain the generalization ability of deep nets even when the size of the dataset is smaller than that of the model. While the prior literature focuses on the classical supervised setting, this paper aims to demystify overparameterization for meta-learning. Here we have a sequence of linear-regression tasks and we ask:(1) Given earlier tasks, what is the optimal linear representation of features for a new downstream task? and (2) How many samples do we need to build this representation? This work shows that surprisingly, overparameterization arises as a natural answer to these fundamental meta-learning questions. Specifically, for (1), we first show that learning the optimal representation coincides with the problem of designing a task-aware regularization to promote inductive bias. We leverage this inductive bias to explain how the downstream task actually benefits from overparameterization, in contrast to prior works on few-shot learning. For (2), we develop a theory to explain how feature covariance can implicitly help reduce the sample complexity well below the degrees of freedom and lead to small estimation error. We then integrate these findings to obtain an overall performance guarantee for our meta-learning algorithm. Numerical experiments on real and synthetic data verify our insights on overparameterized meta-learning.

show abstract

Harmless interpolation of noisy data in regression

Cited by 58 publications

References 15 publications

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve

Two Models of Double Descent for Weak Features

Towards Sample-efficient Overparameterized Meta-learning

Contact Info

Product

Resources

About