“…is non-trivial due to the extreme non-convexity caused by the hierarchical structures in multi-layer networks. For such reason, it is not surprising that most existing theoretical works on the efficient learning regime of neural networks either focus on (1) two-layer networks [9,10,12,22,23,36,41,41,42,46,50,51,53,54,56,58,61] which do not have any hierarchical structure, or (2) a multi-layer network but essentially only the last layer is trained [14,33], or (3) reducing a multi-layer hierarchical neural network to non-hierarchical models such as kernel methods (a.k.a. the neural tangent kernel approach) [3, 5-8, 13, 13, 15, 17, 18, 25, 28, 35, 40, 43, 57, 62].…”