Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Razin, Noam; Maman, Asaf; Cohen, Nadav

doi:10.48550/arxiv.2201.11729

Cited by 1 publication

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The sharp contrast between the so-called kernel and rich regimes (Woodworth et al, 2020) reflects the importance of the initialization scale, where a large initialization often leads to the kernel regime with features barely changing during training (Jacot et al, 2018;Chizat et al, 2018;Du et al, 2018Du et al, , 2019Allen-Zhu et al, 2019b,a;Zou et al, 2020;Arora et al, 2019b;Yang, 2019;Jacot et al, 2021), while with a small initialization, the solution exhibits richer behavior with the resulting model having lower complexity (Gunasekar et al, 2018b,c;Li et al, 2018;Razin and Cohen, 2020;Arora et al, 2019a;Chizat and Bach, 2020;Li et al, 2020;Lyu and Li, 2019;Lyu et al, 2021;Razin et al, 2022;Stöger and Soltanolkotabi, 2021;Ge et al, 2021). Recently Yang and Hu (2021) give a complete characterization on the relationship between initialization scale, parametrization and learning rate in order to avoid kernel regime.…”

Section: Related Workmentioning

confidence: 99%

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent

Li¹,

Wang²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function. Conversely, continuous mirror descent with any Legendre function can be viewed as gradient flow with a related commuting parametrization. The latter result relies upon Nash's embedding theorem.

show abstract