“…The sharp contrast between the so-called kernel and rich regimes (Woodworth et al, 2020) reflects the importance of the initialization scale, where a large initialization often leads to the kernel regime with features barely changing during training (Jacot et al, 2018;Chizat et al, 2018;Du et al, 2018Du et al, , 2019Allen-Zhu et al, 2019b,a;Zou et al, 2020;Arora et al, 2019b;Yang, 2019;Jacot et al, 2021), while with a small initialization, the solution exhibits richer behavior with the resulting model having lower complexity (Gunasekar et al, 2018b,c;Li et al, 2018;Razin and Cohen, 2020;Arora et al, 2019a;Chizat and Bach, 2020;Li et al, 2020;Lyu and Li, 2019;Lyu et al, 2021;Razin et al, 2022;Stöger and Soltanolkotabi, 2021;Ge et al, 2021). Recently Yang and Hu (2021) give a complete characterization on the relationship between initialization scale, parametrization and learning rate in order to avoid kernel regime.…”