2018
DOI: 10.48550/arxiv.1809.08848
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Abstract: We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the sig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
15
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(17 citation statements)
references
References 22 publications
2
15
0
Order By: Relevance
“…stable. In a similar spirit, [14,15] apply random matrix theory and mean field theory to investigate the Jacobian in the limit of extreme deep and/or wide networks. These papers focus on the behavior of the Jacobian of (usually simplified) networks during training, whereas our objective is to understand the behavior of pre-trained networks as they are used in practice, and to this end we also primarily consider the Jacobians of the network components rather than the network as a whole.…”
Section: Related Workmentioning
confidence: 99%
“…stable. In a similar spirit, [14,15] apply random matrix theory and mean field theory to investigate the Jacobian in the limit of extreme deep and/or wide networks. These papers focus on the behavior of the Jacobian of (usually simplified) networks during training, whereas our objective is to understand the behavior of pre-trained networks as they are used in practice, and to this end we also primarily consider the Jacobians of the network components rather than the network as a whole.…”
Section: Related Workmentioning
confidence: 99%
“…A neural network is dynamical isometry as long as every singular value of its input-output Jacobian matrix remains close to 1, thus the norm of every error vector and the angle between error vectors are preserved. With the powerful theorems of free probability and random matrix, Pennington et al (2017) [15] investigate the spectrum density distribution of plaint fully-connected serial network with Gaussian/orthogonal weights and ReLU/hard-tanh activation functions; Tarnowski et al (2018) [16] explore the density of singular values of the input-output Jacobian matrix in ResNet and identify that dynamical isometry can be always achieved regardless of the choice of activation function. However, their studies only cover ResNet whose major branch consists of Gaussian and scaled orthogonal linear transforms and activation functions, and fail to provide a theoretical explanation of batch normalization.…”
Section: Related Work 21 Theorems Of Well-behaved Neural Networkmentioning
confidence: 99%
“…If the network seats steady on the border between the order and chaos phase, it will be trainable even with a depth of 10,000 [4]. Pennington et al (2017) [15], Tarnowski et al (2018) [16], [17] and Ling & Qiu (2018) [18] argue that networks achieving dynamical isometry (all the singular value of the network's input-output Jacobian matrix remain close to 1) do not suffer from gradient explosion or vanishing. Philipp et al (2019) [19] directly evaluate the statistics of the gradient and propose a metric called gradient scale coefficient (GSC) that can verify whether a network would suffer gradient explosion.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Glorot and Bengio [7], He et al [11], Saxe et al [25], Poole et al [23], Pennington et al [21,22], to name some of the most prominent ones). Similarly, previous works have studied initialization strategies for un-normalized ResNets [10,31,32], but they lack large scale experiments demonstrating the effectiveness of the proposed approaches and consider a simplified ResNet setup where shortcut connections are ignored, even though they play an important role [15]. Zhang et al [37] propose an initialization scheme for un-normalized ResNets which involves initializing the different types of layers individually using carefully designed schemes.…”
Section: Background and Existing Workmentioning
confidence: 99%