Benign overfitting in linear regression

Bartlett, Peter L.; Long, Philip M.; Lugosi, Gábor; Tsigler, Alexander

doi:10.1073/pnas.1907378117

Cited by 322 publications

(433 citation statements)

References 16 publications

Supporting

Mentioning

403

Contrasting

Unclassified

Order By: Relevance

“…Additionally, they study the misspecified model asymptotically and show that double descent can occur with sufficiently improved approximability, and replicate the phenomena in a specific non-linear model. [8] sharply upper and lower bound the (non-asymptotic) generalization error of the 2 -minimizing interpolator for Gaussian features (which are not whitened/independent in general). They characterize necessary and sufficient conditions for the 2 -minimizing interpolator to avoid what we call signal "bleed" and noise overfitting in terms of functionals of the spectrum of the Gaussian covariance matrix.…”

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

“…An earlier edition of our work was presented at Information Theory and Applications, February 2019 and subsequently accepted to IEEE International Symposium on Information Theory, July 2019. Several elegant and interesting papers [6][7][8][9][10]46] have appeared around this time. All of these center around the analysis of the 2 -minimizing interpolator.…”

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

“…All of these center around the analysis of the 2 -minimizing interpolator. We discuss [6][7][8] in substantial detail in Section 4, and provide a succinct description of each of these paper's contributions here:…”

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

“…We use non-asymptotic random matrix theory for our results and thus sharply characterize the dependence on n and d in our rates, but not the dependence on constant factors. 13 For the Gaussian case, this scaling can also be derived from the work of Bartlett, Long, Lugosi and Tsigler [8], which analyes the generalization error of the 2 -minimizing interpolation of unwhitened Gaussian features to Gaussian noise more generally. As we see from the second part of Corollary 2, this scaling holds more generally for whitened, iid sub-Gaussian features.…”

Section: The Possibility Of Harmless Interpolationmentioning

confidence: 99%

See 3 more Smart Citations

Harmless Interpolation of Noisy Data in Regression

Muthukumar¹,

Vodrahalli²,

Subramanian³

et al. 2020

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

A continuing mystery in understanding the empirical success of deep neural networks is their ability to achieve zero training error and generalize well, even when the training data is noisy and there are more parameters than data points. We investigate this overparameterized regime in linear regression, where all solutions that minimize training error interpolate the data, including noise. We characterize the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise, and show that this error decays to zero with the number of features. Thus, overparameterization can be explicitly beneficial in ensuring harmless interpolation of noise. We discuss two root causes for poor generalization that are complementary in nature -signal "bleeding" into a large number of alias features, and overfitting of noise by parsimonious feature selectors. For the sparse linear model with noise, we provide a hybrid interpolating scheme that mitigates both these issues and achieves order-optimal MSE over all possible interpolating solutions. arXiv:1903.09139v2 [cs.LG] 9 Sep 2019 2. We provide a Fourier-theoretic interpretation of concurrent analyses [6-10] of the minimum 2 -norm interpolator.3. We show (Theorem 2) that parsimonious interpolators (like the 1 -minimizing interpolator and its relatives) suffer the complementary problem of overfitting pure noise.4. We construct two-step hybrid interpolators that successfully recover signal and harmlessly fit noise, achieving the order-optimal rate of test MSE among all interpolators (Proposition 1 and all its corollaries). Related workWe discuss prior work in three categories: a) overparameterization in deep neural networks, b) interpolation of high-dimensional data using kernels, and c) high-dimensional linear regression. We then recap work on overparameterized linear regression that is concurrent to ours. Recent interest in overparameterizationConventional statistical wisdom is that using more parameters in one's model than data points leads to poor generalization. This wisdom is corroborated in theory by worst-case generalization bounds on such overparameterized models following from VC-theory in classification [2] and ill-conditioning in least-squares regression [5]. It is, however, contradicted in practice by the notable recent trend of empirically successful overparameterized deep neural networks. For example, the commonly used CIFAR-10 dataset contains 60000 images, but the number of parameters in all the neural networks achieving state-of-the-art performance on CIFAR-10 is at least 1.5 million [4]. These neural networks have the ability to memorize pure noisesomehow, they are still able to generalize well when trained with meaningful data.Since the publication of this observation [4,11], the machine learning community has seen a flurry of activity to attempt to explain this phenomenon, both for classification and regression problems, in neural networks. The problem is challenging for three core reasons 2 :1. The optimization landscape for l...

show abstract

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

Section: Concurrent Work In High-dimensional Linear Regressionmentioning

confidence: 99%

Section: The Possibility Of Harmless Interpolationmentioning

confidence: 99%

See 2 more Smart Citations

Harmless Interpolation of Noisy Data in Regression

Muthukumar¹,

Vodrahalli²,

Subramanian³

et al. 2020

IEEE J. Sel. Areas Inf. Theory

View full text Add to dashboard Cite

show abstract

“…Another reason why good solutions can be found so easily by stochastic gradient descent is that, unlike low-dimensional models where a unique solution is sought, different networks with good performance converge from random starting points in parameter space. Because of over-parameterization (13), the degeneracy of solutions changes the nature of the problem from finding a needle in a haystack to a haystack of needles.…”

Section: Origins Of Deep Learning I Have Written a Book The Deep Lementioning

confidence: 99%

The unreasonable effectiveness of deep learning in artificial intelligence

Sejnowski

2020

Proc. Natl. Acad. Sci. U.S.A.

273

141

View full text Add to dashboard Cite

Deep learning networks have been trained to recognize speech, caption photographs and translate text between languages at high levels of performance. Although applications of deep learning networks to real world problems have become ubiquitous, our understanding of why they are so effective is lacking. These empirical results should not be possible according to sample complexity in statistics and non-convex optimization theory. However, paradoxes in the training and effectiveness of deep learning networks are being investigated and insights are being found in the geometry of high-dimensional spaces. A mathematical theory of deep learning would illuminate how they function, allow us to assess the strengths and weaknesses of different network architectures and lead to major improvements. Deep learning has provided natural ways for humans to communicate with digital devices and is foundational for building artificial general intelligence. Deep learning was inspired by the architecture of the cerebral cortex and insights into autonomy and general intelligence may be found in other brain regions that are essential for planning and survival, but major breakthroughs will be needed to achieve these goals.

show abstract