Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Tarnowski, Wojciech; Warchoł, Piotr; Jastrzębski, Stanisław; Tabor, Jacek; Nowak, Maciej A.

doi:10.48550/arxiv.1809.08848

Cited by 7 publications

(17 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…stable. In a similar spirit, [14,15] apply random matrix theory and mean field theory to investigate the Jacobian in the limit of extreme deep and/or wide networks. These papers focus on the behavior of the Jacobian of (usually simplified) networks during training, whereas our objective is to understand the behavior of pre-trained networks as they are used in practice, and to this end we also primarily consider the Jacobians of the network components rather than the network as a whole.…”

Section: Related Workmentioning

confidence: 99%

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

Rothauge,

Yao,

et al. 2019

Preprint

View full text Add to dashboard Cite

We regard pre-trained residual networks (ResNets) as nonlinear systems and use linearization, a common method used in the qualitative analysis of nonlinear systems, to understand the behavior of the networks under small perturbations of the input images. We work with ResNet-56 and ResNet-110 trained on the CIFAR-10 data set. We linearize these networks at the level of residual units and network stages, and the singular value decomposition is used in the stability analysis of these components. It is found that most of the singular values of the linearizations of residual units are 1 and, in spite of the fact that the linearizations depend directly on the activation maps, the singular values differ only slightly for different input images. However, adjusting the scaling of the skip connection or the values of the weights in a residual unit has a significant impact on the singular value distributions. Inspection of how random and adversarial perturbations of input images propagate through the network reveals that there is a dramatic jump in the magnitude of adversarial perturbations towards the end of the final stage of the network that is not present in the case of random perturbations. We attempt to gain a better understanding of this phenomenon by projecting the perturbations onto singular vectors of the linearizations of the residual units.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

Rothauge,

Yao,

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…A neural network is dynamical isometry as long as every singular value of its input-output Jacobian matrix remains close to 1, thus the norm of every error vector and the angle between error vectors are preserved. With the powerful theorems of free probability and random matrix, Pennington et al (2017) [15] investigate the spectrum density distribution of plaint fully-connected serial network with Gaussian/orthogonal weights and ReLU/hard-tanh activation functions; Tarnowski et al (2018) [16] explore the density of singular values of the input-output Jacobian matrix in ResNet and identify that dynamical isometry can be always achieved regardless of the choice of activation function. However, their studies only cover ResNet whose major branch consists of Gaussian and scaled orthogonal linear transforms and activation functions, and fail to provide a theoretical explanation of batch normalization.…”

Section: Related Work 21 Theorems Of Well-behaved Neural Networkmentioning

confidence: 99%

“…If the network seats steady on the border between the order and chaos phase, it will be trainable even with a depth of 10,000 [4]. Pennington et al (2017) [15], Tarnowski et al (2018) [16], [17] and Ling & Qiu (2018) [18] argue that networks achieving dynamical isometry (all the singular value of the network's input-output Jacobian matrix remain close to 1) do not suffer from gradient explosion or vanishing. Philipp et al (2019) [19] directly evaluate the statistics of the gradient and propose a metric called gradient scale coefficient (GSC) that can verify whether a network would suffer gradient explosion.…”

Section: Introductionmentioning

confidence: 99%

“…For example, to calculate the quadratic mean norm (qm norm) of the Jacobian matrices, Philipp et al (2019) [19] assume that the norm of the product of Jacobian matrices has approximate decomposability. The free probability used in Pennington et al (2017) [15], Tarnowski et al (2018) [16], [17] and Ling & Qiu (2018) [18] requires the involved matrices to be freely independent with each other [21], which is not commonly held and difficult to verify [22]. Because of the complex derivation, existing studies usually require strong statistics backgrounds and mathematical skills, which constraints their spread in the community.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Comprehensive and Modularized Statistical Framework for Gradient Norm Equality in Deep Neural Networks

Chen,

Deng,

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

The rapid development of deep neural networks (DNNs) in recent years can be attributed to the various techniques that address gradient explosion and vanishing. In order to understand the principle behind these techniques and develop new methods, plenty of metrics have been proposed to identify networks that are free of gradient explosion and vanishing. However, due to the diversity of network components and complex serial-parallel hybrid connections in modern DNNs, the evaluation of existing metrics usually requires strong assumptions, complex statistical analysis, or has limited application fields, which constraints their spread in the community. In this paper, inspired by the Gradient Norm Equality and dynamical isometry, we first propose a novel metric called Block Dynamical Isometry, which measures the change of gradient norm in individual block. Because our Block Dynamical Isometry is norm-based, its evaluation needs weaker assumptions compared with the original dynamical isometry. To mitigate the challenging derivation, we propose a highly modularized statistical framework based on free probability. Our framework includes several key theorems to handle complex serial-parallel hybrid connections and a library to cover the diversity of network components. Besides, several sufficient prerequisites are provided. Powered by our metric and framework, we analyze extensive initialization, normalization, and network structures. We find that Gradient Norm Equality is a universal philosophy behind them. Then, we improve some existing methods based on our analysis, including an activation function selection strategy for initialization techniques, a new configuration for weight normalization, and a depth-aware way to derive coefficients in SeLU. Moreover, we propose a novel normalization technique named second moment normalization, which is theoretically 30% faster than batch normalization without accuracy loss. Last but not least, our conclusions and methods are evidenced by extensive experiments on multiple models over CIFAR10 and ImageNet.

show abstract

“…Glorot and Bengio [7], He et al [11], Saxe et al [25], Poole et al [23], Pennington et al [21,22], to name some of the most prominent ones). Similarly, previous works have studied initialization strategies for un-normalized ResNets [10,31,32], but they lack large scale experiments demonstrating the effectiveness of the proposed approaches and consider a simplified ResNet setup where shortcut connections are ignored, even though they play an important role [15]. Zhang et al [37] propose an initialization scheme for un-normalized ResNets which involves initializing the different types of layers individually using carefully designed schemes.…”

Section: Background and Existing Workmentioning

confidence: 99%

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Arpit¹,

Campos²,

Bengio³

2019

Preprint

View full text Add to dashboard Cite

Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initialization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks. * Equal contribution. Work done while Víctor Campos was an intern at Salesforce Research.Preprint. Under review.

show abstract

Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Cited by 7 publications

References 22 publications

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

A Comprehensive and Modularized Statistical Framework for Gradient Norm Equality in Deep Neural Networks

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Contact Info

Product

Resources

About