Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Belkin, Mikhail

doi:10.1017/s0962492921000039

Cited by 72 publications

(48 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite being highly complex with the ability to even fit random labels and often trained to interpolate the training data, they achieve state-of-the-art out-of-sample generalization performance across a broad range of domains (Zhang et al, 2021). A partial explanation has been provided by the double-descent phenomenon (Belkin et al, 2019a;Belkin, 2021). Extending the generalization curve beyond the interpolation threshold reveals two regimes: the classical U-curve in the underparameterized regime and a monotonically decreasing curve in the overparameterized regime.…”

Section: Motivation and Related Workmentioning

confidence: 99%

Interpolation and Regularization for Causal Learning

Vankadara¹,

Rendsburg²,

Luxburg³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart-regularization. A large volume of recent theoretical as well as empirical work suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the causal risk of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of confounding strength. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite and learning from observational data is actively harmful.

show abstract

Section: Motivation and Related Workmentioning

confidence: 99%

Interpolation and Regularization for Causal Learning

Vankadara¹,

Rendsburg²,

Luxburg³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several recent works have investigated the nature of modern Deep Neural Networks (DNNs) past the point of zero training error (Belkin, 2021;Nakkiran et al, 2020;Bartlett et al, 2021;Power et al, 2022). The stage at which the training error reaches zero is called the Interpolation Threshold (IT), since at this point, the learned network function interpolates between training samples.…”

Section: Introductionmentioning

confidence: 99%

Nearest Class-Center Simplification through Intermediate Layers

Ido¹,

Dekel²

2022

Preprint

View full text Add to dashboard Cite

Recent advances in theoretical Deep Learning have introduced geometric properties that occur during training, past the Interpolation Thresholdwhere the training error reaches zero. We inquire into the phenomena coined Neural Collapse in the intermediate layers of the networks, and emphasize the innerworkings of Nearest Class-Center Mismatch inside the deepnet. We further show that these processes occur both in vision and language model architectures. Lastly, we propose a Stochastic Variability-Simplification Loss (SVSL) that encourages better geometrical features in intermediate layers, and improves both train metrics and generalization.

show abstract

“…This property is nowadays called the benign overfitting (BO) phenomenon [4,2] and has been the subject of many recent works in the statistical community. Motivation is to identify situations where benign overfitting holds that is when an estimator with a perfect fit on the training data can still generalize well.…”

Section: Introductionmentioning

confidence: 99%

“…We consider this model as a benchmark model because it is likely not reflecting real world data but it is the one that is expected to be universal in the sense that results obtained in other more realistic statistical model could be compared with or tend to the one obtained in this ideal benchmark Gaussian model. The relevance of the approximation of large neural networks by linear models via the neural tangent kernel [19,18] feature map in some regimes has been discussed a lot in the machine learning community for instance in [4,28,1] and references therein.…”

Section: Introductionmentioning

confidence: 99%

A geometrical viewpoint on the benign overfitting property of the minimum $l_2$-norm interpolant estimator

Lecué¹,

Shang²

2022

Preprint

View full text Add to dashboard Cite

Practitioners have observed that some deep learning models generalize well even with a perfect fit to noisy training data [5,45,44]. Since then many theoretical works have revealed some facets of this phenomenon [4, 2, 1, 8] known as benign overfitting. In particular, in the linear regression model, the minimum ℓ2-norm interpolant estimator β has received a lot of attention [1,39] since it was proved to be consistent even though it perfectly fits noisy data under some condition on the covariance matrix Σ of the input vector. Motivated by this phenomenon, we study the generalization property of this estimator from a geometrical viewpoint. Our main results extend and improve the convergence rates as well as the deviation probability from [39]. Our proof differs from the classical bias/variance analysis and is based on the self-induced regularization property introduced in [2]: β can be written as a sum of a ridge estimator β1:k and an overfitting component βk+1:p which follows a decomposition of the features space R p = V 1:k ⊕ ⊥ V k+1:p into the space V 1:k spanned by the top k eigenvectors of Σ and the ones V k+1:p spanned by the p − k last ones. We also prove a matching lower bound for the expected prediction risk. The two geometrical properties of random Gaussian matrices at the heart of our analysis are the Dvoretsky-Milman theorem and isomorphic and restricted isomorphic properties. In particular, the Dvoretsky dimension appearing naturally in our geometrical viewpoint coincides with the effective rank from [1, 39] and is the key tool to handle the behavior of the design matrix restricted to the sub-space V k+1:p where overfitting happens.

show abstract

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Cited by 72 publications

References 35 publications

Interpolation and Regularization for Causal Learning

Interpolation and Regularization for Causal Learning

Nearest Class-Center Simplification through Intermediate Layers

A geometrical viewpoint on the benign overfitting property of the minimum $l_2$-norm interpolant estimator

Contact Info

Product

Resources

About