2023
DOI: 10.34133/research.0024
|View full text |Cite
|
Sign up to set email alerts
|

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Abstract: We overview several properties—old and new—of training overparameterized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous rectified linear unit networks. We study the convergence to a solution with the absolute minimum ρ , which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers is used together with weight decay under different forms of gra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…The SPSS software was used for linear regression analysis [ 36 ] of the first group of data (all data corresponding to the test gradient under the plasma background selected from the full set of gradient data tested in the PBS), second group of data (all data tested in the plasma), and third group of data (full gradient data tested in the PBS). The detailed calculation processes are presented in the Supplementary Materials.…”
Section: Methodsmentioning
confidence: 99%
“…The SPSS software was used for linear regression analysis [ 36 ] of the first group of data (all data corresponding to the test gradient under the plasma background selected from the full set of gradient data tested in the PBS), second group of data (all data tested in the plasma), and third group of data (full gradient data tested in the PBS). The detailed calculation processes are presented in the Supplementary Materials.…”
Section: Methodsmentioning
confidence: 99%
“…The above methods have to some extent improved the generalization ability of the model, but they all heavily rely on the parameter settings of the higher-level neural network, which has significant limitations. The implicit regularization method reduces the generalization error of the model without limiting the representation ability, such as batch normalization 19 and weight normalization 20 . They accelerate the convergence speed of the model and improve the generalization ability of the model by smoothing the variance between data or by normalizing the parameter vector to unit length, coupling the weight matrix decomposition in the network into modules and directions.…”
Section: Related Workmentioning
confidence: 99%
“…The implicit regularization method reduces the generalization error of the model without limiting the representation ability, such as batch normalization 19 and weight normalization. 20 They accelerate the convergence speed of the model and improve the generalization ability of the model by smoothing the variance between data or by normalizing the parameter vector to unit length, coupling the weight matrix decomposition in the network into modules and directions. Thus, the translation and rotation invariance of the input data in the model is enhanced.…”
Section: Regularizationmentioning
confidence: 99%
“…In the more interesting overparametrized square loss case, generalization depends on solving a sort of regularized ERM that consists of finding minimizers of the empirical risk with zero loss, and then selecting the one with lowest complexity. Recent work [38] has provided theoretical and empirical evidence that this can be accomplished by SGD provided that the following conditions are satisfied:…”
Section: Optimization and Open Questionsmentioning
confidence: 99%
“…Higher complexity generally gives higher (upper or lower bounds on) generalization error. Theorem 4.1 [8,37] therefore implies that sparsity of a network dramatically reduces generalization error.…”
mentioning
confidence: 99%