The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2021
DOI: 10.48550/arxiv.2105.10335
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data-driven Weight Initialization with Sylvester Solvers

Abstract: In this work, we propose a data-driven scheme to initialize the parameters of a deep neural network. This is in contrast to traditional approaches which randomly initialize parameters by sampling from transformed standard distributions. Such methods do not use the training data to produce a more informed initialization. Our method uses a sequential layer-wise approach where each layer is initialized using its input activations. The initialization is cast as an optimization problem where we minimize a combinati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…Besides the works discussed in § 6, our work is also loosely related to other parameter prediction methods [113,98,114], analysis of graph structure of neural networks [46], knowledge distillation from multiple teachers [115], compression methods [116] and optimization-based initialization [117][118][119]. Denil et al [113] train a model that can predict a fraction of network parameters given other parameters requiring to retrain the model for each new architecture.…”
Section: Appendixmentioning
confidence: 99%
See 1 more Smart Citation
“…Besides the works discussed in § 6, our work is also loosely related to other parameter prediction methods [113,98,114], analysis of graph structure of neural networks [46], knowledge distillation from multiple teachers [115], compression methods [116] and optimization-based initialization [117][118][119]. Denil et al [113] train a model that can predict a fraction of network parameters given other parameters requiring to retrain the model for each new architecture.…”
Section: Appendixmentioning
confidence: 99%
“…The HyperGAN [114] allows to generate an ensemble of trained parameters in a computationally efficient way, but as the aforementioned works is constrained to a particular architecture. Finally, MetaInit [117], GradInit [118] and Sylvester-based initialization [119] can initialize arbitrary networks by carefully optimizing their initial parameters, but due to the optimization loop they are generally more computationally expensive compared to predicting parameters using GHNs. Overall, these prior works did not formulate the task nor proposed the methods of predicting performant parameters for diverse and large-scale architectures as ours.…”
Section: Appendixmentioning
confidence: 99%
“…Previous initialization methods are mostly handcrafted. They focus on finding proper variance patterns of randomly initialized weights [18,16,40,36] or rely on empirical evidence derived from certain architectures [58,23,15,7]. Recently, [60,8] propose learning-based initialization that learns to tune the norms of the initial weights so as to minimize a quantity that is intimately related to favorable training dynamics.…”
Section: Introductionmentioning
confidence: 99%