Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank

Chou, Hung-Hsu; Gieshoff, Carsten; Maly, Johannes; Rauhut, Holger

doi:10.48550/arxiv.2011.13772

Cited by 4 publications

(9 citation statements)

References 18 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For t = t ⋆ , we first note that inequalities ( 54), (55), and ( 56) follow directly from our assumptions. In order to prove inequality (53) we note that…”

Section: Analysis Of the Spectral Phasementioning

confidence: 93%

See 1 more Smart Citation

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Stöger¹,

Soltanolkotabi²

2021

Preprint

View full text Add to dashboard Cite

Recently there has been significant theoretical progress on understanding the convergence and generalization of gradient-based methods on nonconvex losses with overparameterized models. Nevertheless, many aspects of optimization and generalization and in particular the critical role of small random initialization are not fully understood. In this paper, we take a step towards demystifying this role by proving that small random initialization followed by a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, also puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well. Concretely, we focus on the problem of reconstructing a low-rank matrix from a few measurements via a natural nonconvex formulation. In this setting, we show that the trajectory of the gradient descent iterations from small random initialization can be approximately decomposed into three phases: (I) a spectral or alignment phase where we show that that the iterates have an implicit spectral bias akin to spectral initialization allowing us to show that at the end of this phase the column space of the iterates and the underlying low-rank matrix are sufficiently aligned, (II) a saddle avoidance/refinement phase where we show that the trajectory of the gradient iterates moves away from certain degenerate saddle points, and (III) a local refinement phase where we show that after avoiding the saddles the iterates converge quickly to the underlying low-rank matrix. Underlying our analysis are insights for the analysis of overparameterized nonconvex optimization schemes that may have implications for computational problems beyond low-rank reconstruction.

show abstract

“…For t = t ⋆ , we first note that inequalities ( 54), (55), and ( 56) follow directly from our assumptions. In order to prove inequality (53) we note that…”

Section: Analysis Of the Spectral Phasementioning

confidence: 93%

“…Linear neural networks: In [51,52,53,54,55] the convergence of gradient flow and gradient descent is studied for (deep) linear neural networks of the form min…”

Section: Related Workmentioning

confidence: 99%

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Stöger¹,

Soltanolkotabi²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The gradient descent training takes the linearly fully-connected networks to solutions with implicit regularization of max-margin (Soudry et al, 2018) while linear convolutional networks to linear solutions with another penalty in the frequency domain (Gunasekar et al, 2018a). Deep matrix factorization by deep linear networks with gradient descent induces nuclear norm minimization of the learned matrix, leading to an implicit low-rank regularization (Gunasekar et al, 2018b;Arora et al, 2019;Chou et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Limitation of characterizing implicit regularization by data-independent functions

Leyang¹,

Xu²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task of deep learning theory. However, implicit regularization is in itself not completely defined and well understood. In this work, we make an attempt to mathematically define and study the implicit regularization. Importantly, we explore the limitation of a common approach of characterizing the implicit regularization by dataindependent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms, based on which we provide two recipes for producing classes of onehidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Our results signify the profound datadependency of implicit regularization in general, inspiring us to study in detail the data-dependency of NN implicit regularization in the future.

show abstract

“…Since we observe convergence in plot 1c, this suggests that the bound of Theorem 2.4 may not be entirely sharp. But increasing the step size beyond a certain value leads to divergence as suggested by plot 2d, so that some bound on the step size is necessary (see also [7,Lemma A.1] for a necessary condition in a special case). In our second set of experiments we use a sequence of step sizes η k that converges to zero at various speeds.…”

Section: Numerical Experimentsmentioning

confidence: 99%

“…As a result, one possible explanation of the phenomenon of the good generalization property of overparameterized learned neural networks is that the implicit bias of (stochastic) gradient descent is towards to solutions of low complexity in a suitable sense, resulting in good generalization. While a theoretical analysis of this phenomenon seems difficult for nonlinear networks, first works for linear networks indicate that gradient descent leads to linear networks (factorized matrices) of low rank [2,7,10,11,16], although many open questions remain. Another important role seems to be played by the random initialization, see e.g.…”

Section: Introductionmentioning

confidence: 99%

Convergence of gradient descent for learning linear neural networks

Maxime¹,

Rauhut²,

Terstiege³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the step sizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.

show abstract

Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank

Cited by 4 publications

References 18 publications

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction

Limitation of characterizing implicit regularization by data-independent functions

Convergence of gradient descent for learning linear neural networks

Contact Info

Product

Resources

About