2018
DOI: 10.48550/arxiv.1803.08367
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gradient Descent Quantizes ReLU Network Features

Abstract: Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem Zhang et al. [2017]. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have specific properties in the loss landscape. However, even with plain Gradient Descent (GD) the solutions found in the ov… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
32
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(41 citation statements)
references
References 2 publications
3
32
0
Order By: Relevance
“…A trade-off between the scale of the initialization and the training regime is also provided in [WTS + 19, SPD + 20]. [MBG18] proves that the gradient flow enforces the weight vectors to concentrate at a small number of directions determined by the input data. Through the lens of spline theory, [PN20b] explains that a number of best practices used in deep learning, such as weight decay and path-norm, are connected to the ReLU activation and its smooth-counterparts.…”
Section: Related Workmentioning
confidence: 96%
“…A trade-off between the scale of the initialization and the training regime is also provided in [WTS + 19, SPD + 20]. [MBG18] proves that the gradient flow enforces the weight vectors to concentrate at a small number of directions determined by the input data. Through the lens of spline theory, [PN20b] explains that a number of best practices used in deep learning, such as weight decay and path-norm, are connected to the ReLU activation and its smooth-counterparts.…”
Section: Related Workmentioning
confidence: 96%
“…For example, the Frequency Principle (Xu et al, 2019(Xu et al, , 2020 states that NNs often fit target functions from low to high frequencies during the training. A series of works study the mechanism of condensation at an initial training stage, such as for ReLU network (Maennel et al, 2018;Pellegrini and Biroli, 2020) and network with continuously differentiable activation functions (Xu et al, 2021). This work in some sense serves as our attempt to uncover the theoretical structure underlying the condensation phenomenon from the perspective of loss function by proving a general Embedding Principle.…”
Section: Related Workmentioning
confidence: 99%
“…The frequency principle (Xu et al, 2019(Xu et al, , 2020Rahaman et al, 2019; shows that NNs, over-parameterized or not, tend to fit the training data by a low-frequency function, which suggests that the learned function by an NN is often of much lower complexity than the NN's capacity. Specifically, with small initialization, e.g., in a condensed regime, weights of an NN are empirically found to condense on isolated directions resulting in an output function mimicking that of a narrower NN Maennel et al, 2018). These observations raise a question that in which sense learning of a wide NN is not drastically different from a narrower NN despite potentially huge difference in their numbers of parameters.…”
Section: Introductionmentioning
confidence: 99%
“…Several theoretical works studying neural network training with small initialization can be connected to simplicity bias. Maennel et al (2018) uncovered a weight quantization effect in training two-layer nets with small initialization: gradient flow biases the weight vectors to a certain number of directions determined by the input data (independent of neural network width). It is hence argued that gradient flow has a bias towards "simple" functions, but their proof is not entirely rigorous and no clear definition of simplicity is given.…”
Section: Related Workmentioning
confidence: 99%