Benign Overfitting of Constant-Stepsize SGD for Linear Regression

Zou, Difan; Wu, Jingfeng; Braverman, Vladimir; Gu, Quanquan; Kakade, Sham M.

doi:10.48550/arxiv.2103.12692

Cited by 4 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We refer to Section 4.2 for more distribution details. In light of these bounds, Ours outperforms Bartlett et al (2020) in all the cases, and outperforms Zou et al (2021) in Constant / Piecewise Constant cases if ε < 1 2 and q < min{2 − r, 3 2 }.…”

Section: Examplesmentioning

confidence: 92%

“…Benign Overfitting focuses on deriving non-asymptotic generalization guarantees for overparameterized linear models (Bartlett et al, 2020), which relies on a strict assumption of the feature covariance matrix. Some recent papers focus on deriving benign overfitting under different regimes, e.g., constant-stepsize SGD (Zou et al, 2021), ridge regression (Tsigler and Bartlett, 2020), Random Features (Li et al, 2020b), Gaussian Mixture models (Wang and Thrampoulidis, 2021). This paper relaxes the requirement on the feature covariance matrix by introducing time-variant bounds.…”

Section: Related Workmentioning

confidence: 99%

“…Note that both Assumption 1 and Assumption 2 are mild and commonly considered in related literature (Bartlett et al, 2020;Tsigler and Bartlett, 2020;Zou et al, 2021).…”

Section: Letmentioning

confidence: 99%

“…In each example, we show the data distribution, the time interval, and the corresponding generalization bound. These distributions are widely discussed in (Bartlett et al, 2020;Zou et al, 2021) Example 4.1. Under the same conditions as Theorem 4.1, let Σ denote the feature covariance matrix.…”

Section: Examplesmentioning

confidence: 99%

See 3 more Smart Citations

When do Models Generalize? A Perspective from Data-Algorithm Compatibility

Xu¹,

Teng²,

Yao³

2022

Preprint

View full text Add to dashboard Cite

Benign overfitting demonstrates that overparameterized models can perform well on test data while fitting noisy training data. However, it only considers the final min-norm solution in linear regression, which ignores the algorithm information and the corresponding training procedure. In this paper, we generalize the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis. Starting from the timevariant bound, we further derive a time interval that suffices to guarantee a consistent generalization error for a given feature covariance. Unlike existing approaches, the newly proposed generalization bound is characterized by a time-variant effective dimension of feature covariance. By introducing the time factor, we relax the strict assumption on the feature covariance matrix required in previous benign overfitting under the regimes of overparameterized linear regression with gradient descent. This paper extends the scope of benign overfitting, and experiment results indicate that the proposed bound accords better with empirical evidence.

show abstract

Section: Examplesmentioning

confidence: 92%

Section: Related Workmentioning

confidence: 99%

“…Note that both Assumption 1 and Assumption 2 are mild and commonly considered in related literature (Bartlett et al, 2020;Tsigler and Bartlett, 2020;Zou et al, 2021).…”

Section: Letmentioning

confidence: 99%

Section: Examplesmentioning

confidence: 99%

See 2 more Smart Citations

When do Models Generalize? A Perspective from Data-Algorithm Compatibility

Xu¹,

Teng²,

Yao³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Bias-variance decomposition is widely used in machine learning analysis, e.g., adversarial training [51], double descent [1], uncertainty [25]. This paper considers a slightly different bias-variance decomposition following the analysis of SGD [14,27,57], where high bias means that the model cannot fit the noise data perfectly and high variance means that the model.…”

Section: Related Workmentioning

confidence: 99%

Towards Understanding Generalization via Decomposing Excess Risk Dynamics

Teng¹,

Ma²,

Yang³

2021

Preprint

View full text Add to dashboard Cite

Generalization is one of the critical issues in machine learning. However, traditional methods like uniform convergence are not powerful enough to fully explain generalization because they may yield vacuous bounds even in overparameterized linear regression regimes. An alternative solution is to analyze the generalization dynamics to derive algorithm-dependent bounds, e.g., stability. Unfortunately, the stability-based bound is still far from explaining the remarkable generalization ability of neural networks due to the coarse-grained analysis of the signal and noise. Inspired by the observation that neural networks show a slow convergence rate when fitting noise, we propose decomposing the excess risk dynamics and applying stability-based bound only on the variance part (which measures how the model performs on pure noise). We provide two applications for the framework, including a linear case (overparameterized linear regression with gradient descent) and a non-linear case (matrix recovery with gradient flow). Under the decomposition framework, the new bound accords better with the theoretical and empirical evidence compared to the stability-based bound and uniform convergence bound.

show abstract

Dimension independent excess risk by stochastic gradient descent

Chen

Liu

Tong

2022

Electron. J. Statist.

View full text Add to dashboard Cite

One classical canon of statistics is that large models are prone to overfitting, and model selection procedures are necessary for high dimensional data. However, many overparameterized models, such as neural networks, perform very well in practice, although they are often trained with simple online methods and regularization. The empirical success of overparameterized models, which is often known as benign overfitting, motivates us to have a new look at the statistical generalization theory for online optimization. In particular, we present a general theory on the excess risk of stochastic gradient descent (SGD) solutions for both convex and locally non-convex loss functions. We further discuss data and model conditions that lead to a "low effective dimension". Under these conditions, we show that the excess risk either does not depend on the ambient dimension p or depends on p via a poly-logarithmic factor. We also demonstrate that in several widely used statistical models, the "low effective dimension" arises naturally in overparameterized settings. The studied statistical applications include both convex models such as linear regression and logistic regression and non-convex models such as M -estimator and two-layer neural networks.

show abstract

Benign Overfitting of Constant-Stepsize SGD for Linear Regression

Cited by 4 publications

References 14 publications

When do Models Generalize? A Perspective from Data-Algorithm Compatibility

When do Models Generalize? A Perspective from Data-Algorithm Compatibility

Towards Understanding Generalization via Decomposing Excess Risk Dynamics

Dimension independent excess risk by stochastic gradient descent

Contact Info

Product

Resources

About