Implicit Bias of Gradient Descent on Linear Convolutional Networks

Gunasekar, Suriya; Lee, Jason; Soudry, Daniel; Srebro, Nathan

doi:10.48550/arxiv.1806.00468

Cited by 18 publications

(55 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep linear networks (FCNs and CNNs) similar to our CNN toy example have been studied in the literature [4,16,20,37]. These studies use different approaches and assumptions and do not discuss the target shift mechanism which applies also for non-linear CNNs.…”

Section: A Additional Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

Naveh,

Ringel

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks (DNNs) in the infinite width/channel limit have received much attention recently, as they provide a clear analytical window to deep learning via mappings to Gaussian Processes (GPs). Despite its theoretical appeal, this viewpoint lacks a crucial ingredient of deep learning in finite DNNs, laying at the heart of their success -feature learning. Here we consider DNNs trained with noisy gradient descent on a large training set and derive a self consistent Gaussian Process theory accounting for strong finite-DNN and feature learning effects. Applying this to a toy model of a two-layer linear convolutional neural network (CNN) shows good agreement with experiments. We further identify, both analytical and numerically, a sharp transition between a feature learning regime and a lazy learning regime in this model. Strong finite-DNN effects are also derived for a non-linear two-layer fully connected network. Our self consistent theory provides a rich and versatile analytical framework for studying feature learning and other non-lazy effects in finite DNNs.

show abstract

Section: A Additional Related Workmentioning

confidence: 99%

“…The results are shown in Fig 1 where we compare the theoretical predictions given by the solutions of the self consistent equation (16) to the empirical values of α obtained by training actual CNNs and averaging their outputs across the ensemble. As n grows, the two converge to the identity line (dashed black line).…”

Section: Numerical Verificationmentioning

confidence: 99%

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

Naveh,

Ringel

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Efforts to explain the effectiveness of gradient descent in deep learning have uncovered an exciting possibility: it not only finds solutions with low error, but also biases the search for low complexity solutions which generalize well (Zhang et al, 2017;Bartlett et al, 2017;Soudry et al, 2017;Gunasekar et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

Gradient descent aligns the layers of deep linear networks

Ji,

Telgarsky

2018

Preprint

View full text Add to dashboard Cite

This paper establishes risk convergence and asymptotic weight matrix alignment -a form of implicit regularization -of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i th weight matrix asymptotically equals its rank-1 approximation uiv i ; (iii) these rank-1 matrices are aligned across layers, meaning |v i+1 ui| → 1. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network -the product of its weight matrices -converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.

show abstract

“…In defiance of the classical bias-variance trade-off, the performance of these interpolating classifiers continuously improves as the number of parameters increases well beyond the number of training samples [3][4][5][6]. Despite recent progress in describing the implicit bias of stochastic gradient descent towards "good" minima [7][8][9][10][11][12], and the detailed analysis of solvable models of learning [13][14][15][16][17][18][19][20][21][22][23][24], the mechanisms underlying this "benign overfitting" [25] in DNNs remain partially unclear, especially since "bad" local minima exist in the optimisation landscape of DNNs [26].…”

Section: Introductionmentioning

confidence: 99%

Redundant representations help generalization in wide neural networks

Doimo¹,

Glielmo²,

Goldt³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that exactly interpolates its training data will typically improve its generalisation performance. Explaining the mechanism behind the benefit of such over-parameterisation is an outstanding challenge for deep learning theory.Here, we study the last layer representation of various deep architectures such as Wide-ResNets for image classification and find evidence for an underlying mechanism that we call representation mitosis: if the last hidden representation is wide enough, its neurons tend to split into groups which carry identical information, and differ from each other only by a statistically independent noise. Like in a mitosis process, the number of such groups, or "clones", increases linearly with the width of the layer, but only if the width is above a critical value. We show that a key ingredient to activate mitosis is continuing the training process until the training error is zero. Finally, we show that in one of the learning tasks we considered, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones.

show abstract

Implicit Bias of Gradient Descent on Linear Convolutional Networks

Cited by 18 publications

References 9 publications

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

Gradient descent aligns the layers of deep linear networks

Redundant representations help generalization in wide neural networks

Contact Info

Product

Resources

About