2015
DOI: 10.48550/arxiv.1511.06422
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

All you need is a good init

Abstract: Layer-sequential unit-variance (LSUV) initialization -a simple method for weight initialization for deep net learning -is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
133
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 99 publications
(136 citation statements)
references
References 5 publications
1
133
0
Order By: Relevance
“…(b) High training complexity: The transmitter needs to perform several tasks, such as symbol mapping, PS, and pre-distortion jointly, and learning the transmitted waveform involves sequential input data, which significantly increases the NN size with the "one-hot" encoding being applied, therefore increasing the training complexity. (c) Parameter initialization: It is difficult to know which parameter choice leads to good performance prior to training, and random parameters initialization can slow down or even completely stall the convergence process [48].…”
Section: A Autoencoder Designmentioning
confidence: 99%
“…(b) High training complexity: The transmitter needs to perform several tasks, such as symbol mapping, PS, and pre-distortion jointly, and learning the transmitted waveform involves sequential input data, which significantly increases the NN size with the "one-hot" encoding being applied, therefore increasing the training complexity. (c) Parameter initialization: It is difficult to know which parameter choice leads to good performance prior to training, and random parameters initialization can slow down or even completely stall the convergence process [48].…”
Section: A Autoencoder Designmentioning
confidence: 99%
“…Performing hyperparameter optimization is computationally expensive so we rely on empirical tests to guide the settings. We use an architecture of 8 hidden layers with 64 nodes each, applying the Gaussian Error Linear Unit (GELU) [34] activation and LSUV weight initialization [35]. Using the Adam [36] optimizer, we minimize either the mean squared error (MSE) loss…”
Section: Multilayer Perceptron (Mlp)mentioning
confidence: 99%
“…Their design is specific to convolutions with certain non-linearities. Mishkin and Matas [12] and Krähenbühl et al [9] have devised alternative inits for CNNs which initialize layer-by-layer such that that the variance of the activation affect each layer remains constant, e.g. close to one.…”
Section: Related Workmentioning
confidence: 99%