2021
DOI: 10.48550/arxiv.2106.15013
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Abstract: Recently there has been significant theoretical progress on understanding the convergence and generalization of gradient-based methods on nonconvex losses with overparameterized models. Nevertheless, many aspects of optimization and generalization and in particular the critical role of small random initialization are not fully understood. In this paper, we take a step towards demystifying this role by proving that small random initialization followed by a few iterations of gradient descent behaves akin to popu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
6
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 33 publications
2
6
0
Order By: Relevance
“…Similar phenomena have been empirically observed in many other nonconvex problems, where vanilla gradient descent, when coupled with small random initialization (SRI) and early stopping (ES), has good generalization performance even with overpametrization due to the algorithmic regularization effect of SRI annd ES [Woodworth et al, 2020, Ghorbani et al, 2020, Prechelt, 1998, Wang et al, 2021, Li et al, 2018, Stöger and Soltanolkotabi, 2021. This motivates us to study the following question: What is the general behavior of the gradient descent dynamic (GD-M) coupled with SRI and ES?…”
Section: Introductionsupporting
confidence: 57%
See 1 more Smart Citation
“…Similar phenomena have been empirically observed in many other nonconvex problems, where vanilla gradient descent, when coupled with small random initialization (SRI) and early stopping (ES), has good generalization performance even with overpametrization due to the algorithmic regularization effect of SRI annd ES [Woodworth et al, 2020, Ghorbani et al, 2020, Prechelt, 1998, Wang et al, 2021, Li et al, 2018, Stöger and Soltanolkotabi, 2021. This motivates us to study the following question: What is the general behavior of the gradient descent dynamic (GD-M) coupled with SRI and ES?…”
Section: Introductionsupporting
confidence: 57%
“…• Model-free setting. Most of existing analysis considers the setting where X is (exactly or approximately) low-rank with a sufficiently large singular value gap δ [Li et al, 2018, Zhuo et al, 2021, Ye and Du, 2021, Fan et al, 2020, Stöger and Soltanolkotabi, 2021.…”
Section: Iteration Complexity and Stepsizementioning
confidence: 99%
“…Based on our simulations, we observed that SubGM with small random initialization behaves almost the same as SubGM with spectral initialization. Therefore, we conjecture that small random initialization followed by a few iterations of SubGM is in fact equivalent to spectral initialization; a similar result has been recently proven by Stöger and Soltanolkotabi [31] for gradient descent on 2 -loss. We consider a rigorous verification of this conjecture as an enticing challenge for future research.…”
Section: Discussionsupporting
confidence: 77%
“…Therefore, the desirable performance of SubGM with small initialization can be attributed to its implicit regularization property. In particular, we show that small initialization of SubGM is akin to implicitly regularizing the redundant rank of the over-parameterized model, thereby avoiding overfitting; a recent work [31] has shown a similar property for the gradient descent algorithm on the noiseless matrix recovery with 2 -loss.…”
Section: Power Of Small Initializationmentioning
confidence: 70%
See 1 more Smart Citation