Deep ReLU Networks Preserve Expected Length

Hanin, Boris; Jeong, Ryan; Rolnick, David

doi:10.48550/arxiv.2102.10492

Cited by 4 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The conclusion in Theorem 2.2 is not new, having been obtained many times and under a variety of different assumptions (including for more general architectures) [31,49,56,68,82]. We refer the interested reader to [31] for a discussion of prior work and note only that convergence of the derivatives of the field z (L+1) α to its Gaussian limit does not seem to have been previously considered. We give a short proof that includes convergence of derivatives along the lines of the arguments in [31,49] in Appendix §A.…”

Section: Neural Network-centric Motivationsmentioning

confidence: 91%

“…We refer the interested reader to [31] for a discussion of prior work and note only that convergence of the derivatives of the field z (L+1) α to its Gaussian limit does not seem to have been previously considered. We give a short proof that includes convergence of derivatives along the lines of the arguments in [31,49] in Appendix §A.…”

Section: Neural Network-centric Motivationsmentioning

confidence: 99%

“…Unlike in random matrix theory, universality for random neural networks depends not on the statistics of the individual weights and biases (though this is also an interesting direction to consider e.g. [31]) but rather on the effect of the non-linearity σ on the behavior of the infinite width covariance K (ℓ) at large values of the depth ℓ.…”

Section: Criticality and Universality In Random Neural Networkmentioning

confidence: 99%

See 2 more Smart Citations

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

Hanin¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

This article considers fully connected neural networks with Gaussian random weights and biases and L hidden layers, each of width proportional to a large parameter n. For polynomially bounded non-linearities we give sharp estimates in powers of 1/n for the joint correlation functions of the network output and its derivatives. Moreover, we obtain exact layerwise recursions for these correlation functions and solve a number of special cases for classes of non-linearities including ReLU and tanh. We find in both cases that the depth-to-width ratio L/n plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. We use this to study a somewhat simplified version of the so-called exploding and vanishing gradient problem, proving that this particular variant occurs if and only if L/n is large. Several of the key ideas in this article were first developed at a physics level of rigor in a recent monograph [71] with Roberts and Yaida.

show abstract

Section: Neural Network-centric Motivationsmentioning

confidence: 91%

Section: Neural Network-centric Motivationsmentioning

confidence: 99%

Section: Criticality and Universality In Random Neural Networkmentioning

confidence: 99%

See 1 more Smart Citation

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

Hanin¹

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Based on these notions, Murray et al (2022) studies how to avoid rapid convergence of pairwise input correlations, vanishing and exploding gradients. However, Hanin et al (2021) proved that for a ReLU network with He initialization the length of the curve does not grow with the depth and even shrinks slightly. We establish similar results for maxout networks.…”

Section: Introductionmentioning

confidence: 99%

Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

Tseran¹,

Montúfar²

2023

Preprint

View full text Add to dashboard Cite

We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.Maxout networks A rank-K maxout unit, introduced by Goodfellow et al. ( 2013), computes the maximum of K real-valued parametric affine functions. Concretely, a rank-K maxout unit with n inputs implements a function. . , K}, are trainable weights and biases. The K arguments of the maximum are called the pre-activation features of the maxout unit. This may be regarded as a multi-argument generalization of a ReLU, which computes the maximum of a real-valued affine function and zero. Goodfellow et al. (2013) demonstrated that maxout networks could perform better than ReLU networks under similar circumstances. Additionally, maxout networks have been shown to be useful for combating catastrophic forgetting in neural networks (Goodfellow et al., 2015). On the other hand, Castaneda et al. ( 2019) evaluated the performance of maxout networks in a big data setting and observed that increasing the width of ReLU networks is more effective in improving performance than replacing ReLUs with maxout units and that ReLU networks converge faster

show abstract

“…deep networks, how operations (linear transformations and non-linear activations) are connected and stacked together is vital, which is studied in network's convergence (Du et al, 2019;Zhou et al, 2020;Zou et al, 2020b), complexity (Poole et al, 2016;Rieck et al, 2018;Hanin et al, 2021), generalization (Chen et al, 2019b;Cao & Gu, 2019;Xiao et al, 2019), loss landscapes (Li et al, 2017;Fort & Jastrzebski, 2019;Shevchenko & Mondelli, 2020), etc.…”

Section: Introductionmentioning

confidence: 99%

Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis

Chen¹,

Huang²,

Gong³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusive to derive any principled understanding of their effects on the DNN capacity or trainability, and to understand why or in which aspect one specific connectivity pattern is better than another. In this work, we theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity. By analyzing a wide network's Neural Network Gaussian Process (NNGP), we are able to depict how the spectrum of an NNGP kernel propagates through a particular connectivity pattern, and how that affects the bound of convergence rates. As one practical implication of our results, we show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead. Codes will be released at https://github.com/chenwydj/ architecture_convergence.

show abstract

Deep ReLU Networks Preserve Expected Length

Cited by 4 publications

References 7 publications

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

Expected Gradients of Maxout Networks and Consequences to Parameter Initialization

Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis

Contact Info

Product

Resources

About