Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

Martens, J.; Ballard, Andy; Desjardins, Guillaume; Świrszcz, Grzegorz; Dalibard, Valentin; Sohl-Dickstein, Jascha; Schoenholz, Samuel S.

doi:10.48550/arxiv.2110.01765

Cited by 6 publications

(18 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In summary, based on our experiments on different initializations and activation functions and our findings on ResNet-18, NNPK seems to hint at the choices that are known to work also well for fully trained models (Gotmare et al, 2018;Shah et al, 2016;Martens et al, 2021).…”

Section: Effect Of Activation Functionmentioning

confidence: 57%

“…First, the NRF seems to improve by using skip connections for both initialization and activation function combinations. However, the performance of the trained model using the adjustments proposed in (Martens et al, 2021) seems to improve without the skip connections. This observation shows that there are cases for which NNPK may not reflect the performance of the final trained model (although the results may vary when using data augmentation and other types of regularizations).…”

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 96%

“…There has been a number of efforts to replace or remove these two elements from ResNets (Zhang et al, 2019;Gaur et al, 2020;Bachlechner et al, 2021). The more recent adjustments proposed in (Martens et al, 2021) have shown further promise. This modification includes replacing the ReLU activation function with a scaled leaky ReLU and the use of delta orthogonal initialization for the convolutional kernels (and orthogonal initialization for the final dense layer).…”

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

“…Since the mean and scale values are initially set to 0 and 1, respectively, BatchNorm with default initialization will have no effect on the NNPK, but affects the performance of the trained model. We also examine whether NRF varies by removing skip connections and whether the changes proposed in (Martens et al, 2021) affect the results.…”

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

“…We create variants of the ResNet-18 model by removing BatchNorm and/or skip connections. In the first approach, we use the default He normal initialization with ReLU activation while in the second case, we use the delta orthogonal initialization proposed in (Martens et al, 2021) for the convolutional filters (and orthogonal initialization for the final dense layer). We also replace the ReLU activation with the scaled leaky ReLU proposed in (Martens et al, 2021).…”

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

See 4 more Smart Citations

Learning from Randomly Initialized Neural Network Features

Amid¹,

Anil²,

Kotłowski³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present the surprising result that randomly initialized neural networks are good feature extractors in expectation. These random features correspond to finite-sample realizations of what we call Neural Network Prior Kernel (NNPK), which is inherently infinite-dimensional. We conduct ablations across multiple architectures of varying sizes as well as initializations and activation functions. Our analysis suggests that certain structures that manifest in a trained model are already present at initialization. Therefore, NNPK may provide further insight into why neural networks are so effective in learning such structures.

show abstract

Section: Effect Of Activation Functionmentioning

confidence: 57%

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 96%

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

Section: Does Skip Connection Improve Nrf?mentioning

confidence: 99%

See 3 more Smart Citations

Learning from Randomly Initialized Neural Network Features

Amid¹,

Anil²,

Kotłowski³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Nonlinear Initialization Methods for Low-Rank Neural Networks

Vodrahalli¹,

Shivanna²,

Sathiamoorthy³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study algorithms for learning low-rank neural networks -networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error with probability ≥ 1 − δ, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error with respect to a strictly smaller hypothesis class. To solve this problem, we introduce an efficient SVD-based Nonlinear Kernel Projection algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank deep networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNet and EfficientNet models [

show abstract

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

Zhou¹,

Li²,

Ding³

et al. 2022

Preprint

View full text Add to dashboard Cite

When training deep neural networks for classification tasks, an intriguing empirical phenomenon has been widely observed in the last-layer classifiers and features, where (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. This phenomenon is called Neural Collapse (NC), which seems to take place regardless of the choice of loss functions. In this work, we justify NC under the mean squared error (MSE) loss, where recent empirical evidence shows that it performs comparably or even better than the de-facto cross-entropy loss. Under a simplified unconstrained feature model, we provide the first global landscape analysis for vanilla nonconvex MSE loss and show that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Furthermore, we justify the usage of rescaled MSE loss by probing the optimization landscape around the NC solutions, showing that the landscape can be improved by tuning the rescaling hyperparameters. Finally, our theoretical findings are experimentally verified on practical network architectures.

show abstract

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

Cited by 6 publications

References 32 publications

Learning from Randomly Initialized Neural Network Features

Learning from Randomly Initialized Neural Network Features

Nonlinear Initialization Methods for Low-Rank Neural Networks

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

Contact Info

Product

Resources

About