Deep learning: a statistical viewpoint

Montanari, Andrea; Rakhlin, Alexander

doi:10.48550/arxiv.2103.09177

Cited by 16 publications

(24 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we survey the emerging field of TOPML research with a principal focus on foundational principles developed in the past few years. Compared to other recent surveys (Bartlett et al, 2021;Belkin, 2021), we take a more elementary signal processing perspective to elucidate these principles. Formally, we define the TOPML research area as the sub-field of ML theory where 1. there is clear consideration of exact or near interpolation of training data 2. the learned model complexity is high with respect to the training dataset size.…”

Section: Contents Of This Papermentioning

confidence: 99%

See 1 more Smart Citation

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

Dar¹,

Muthukumar²,

Baraniuk³

2021

Preprint

View full text Add to dashboard Cite

The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -from simple linear models to deep neural networks -have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance.Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.

show abstract

Section: Contents Of This Papermentioning

confidence: 99%

“…≥ λ p . We refer the reader to Bartlett et al (2021) for a detailed exposition of these results and their consequences. Here, we present the essence of these results in a popular model used in high-dimensional statistics, the spiked covariance model.…”

Section: When Does the Minimum 2 -Norm Solution Enable Harmless Inter...mentioning

confidence: 99%

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

Dar¹,

Muthukumar²,

Baraniuk³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…These empirical mysteries inspire a recent flurry of activity towards understanding the generalization properties of various interpolators. A dominant fraction of recent efforts, however, concentrated on studying certain minimum 2 -norm interpolators, primarily in the context of linear and/or kernel regression (see, e.g., ; Mei and Montanari (2019); Hastie et al (2019); Belkin et al (2020); Bartlett et al (2020Bartlett et al ( , 2021 and the references therein). This was in part due to the existence of closed-form expressions for minimum 2 -norm interpolators, which are particularly handy when determining the statistical risk.…”

Section: Introductionmentioning

confidence: 99%

Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent

Li¹,

Wei²

2021

Preprint

View full text Add to dashboard Cite

An evolving line of machine learning works observe empirical evidence that suggests interpolating estimators -the ones that achieve zero training error -may not necessarily be harmful. This paper pursues theoretical understanding for an important type of interpolators: the minimum 1-norm interpolator, which is motivated by the observation that several learning algorithms favor low 1-norm solutions in the over-parameterized regime. Concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size).We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon; that is, the generalization risk of the minimum 1-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. This phenomenon stems from the special structure of the minimum 1-norm interpolator as well as the delicate interplay between the over-parameterized ratio and the sparsity, thus unveiling a fundamental distinction in geometry from the minimum 2-norm interpolator. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.

show abstract

“…Obstacles in the theoretical foundation include the higher-order nonlinear structures due to the stacking of multiple layers and the excessive number of network parameters in state of the art networks. For some recent surveys, see [5,6].…”

Section: 𝑅( A) − 𝑅 𝑛 ( A)mentioning

confidence: 99%

“…For the filtration {F 𝑡 } 𝑡≥0 introduced before, batch gradient noise is defined as the A-dependent F 𝑡+1measurable random vector 𝑊 𝑡+1 (A) = √ 𝑚(∇𝑅 𝑛 (A) − ∇ 𝑅 𝑡 𝑛 (A)). This random variable measures the effect of the subsampling on the gradient and allows to rewrite (5) as…”

Section: Relation Betweenmentioning

confidence: 99%

On generalization bounds for deep networks based on loss surface implicit regularization

Imaizumi¹,

Schmidt-Hieber²

2022

Preprint

View full text Add to dashboard Cite

RIKEN AIP A. The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.

show abstract

Deep learning: a statistical viewpoint

Cited by 16 publications

References 60 publications

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent

On generalization bounds for deep networks based on loss surface implicit regularization

Contact Info

Product

Resources

About