Data-driven Weight Initialization with Sylvester Solvers

Das, Debasmit; Bhalgat, Yash; Porikli, Fatih

doi:10.48550/arxiv.2105.10335

Cited by 2 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides the works discussed in § 6, our work is also loosely related to other parameter prediction methods [113,98,114], analysis of graph structure of neural networks [46], knowledge distillation from multiple teachers [115], compression methods [116] and optimization-based initialization [117][118][119]. Denil et al [113] train a model that can predict a fraction of network parameters given other parameters requiring to retrain the model for each new architecture.…”

Section: Appendixmentioning

confidence: 99%

“…The HyperGAN [114] allows to generate an ensemble of trained parameters in a computationally efficient way, but as the aforementioned works is constrained to a particular architecture. Finally, MetaInit [117], GradInit [118] and Sylvester-based initialization [119] can initialize arbitrary networks by carefully optimizing their initial parameters, but due to the optimization loop they are generally more computationally expensive compared to predicting parameters using GHNs. Overall, these prior works did not formulate the task nor proposed the methods of predicting performant parameters for diverse and large-scale architectures as ours.…”

Section: Appendixmentioning

confidence: 99%

See 1 more Smart Citation

Parameter Prediction for Unseen Deep Architectures

Knyazev¹,

Drozdzal²,

Taylor³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning has been successful in automating the design of features in machine learning pipelines. However, the algorithms optimizing neural network parameters remain largely hand-designed and computationally inefficient. We study if we can use deep learning to directly predict these parameters by exploiting the past knowledge of training other networks. We introduce a large-scale dataset of diverse computational graphs of neural architectures -DEEPNETS-1M-and use it to explore parameter prediction on CIFAR-10 and ImageNet. By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks. For example, it is able to predict all 24 million parameters of a ResNet-50 achieving a 60% accuracy on CIFAR-10. On ImageNet, top-5 accuracy of some of our networks approaches 50%. Our task along with the model and results can potentially lead to a new, more computationally efficient paradigm of training networks. Our model also learns a strong representation of neural architectures enabling their analysis.

show abstract

Section: Appendixmentioning

confidence: 99%

Section: Appendixmentioning

confidence: 99%

Parameter Prediction for Unseen Deep Architectures

Knyazev¹,

Drozdzal²,

Taylor³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Previous initialization methods are mostly handcrafted. They focus on finding proper variance patterns of randomly initialized weights [18,16,40,36] or rely on empirical evidence derived from certain architectures [58,23,15,7]. Recently, [60,8] propose learning-based initialization that learns to tune the norms of the initial weights so as to minimize a quantity that is intimately related to favorable training dynamics.…”

Section: Introductionmentioning

confidence: 99%

Towards Theoretically Inspired Neural Initialization Optimization

Yang¹,

Wang²,

Yuan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the samplewise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.

show abstract

Data-driven Weight Initialization with Sylvester Solvers

Cited by 2 publications

References 10 publications

Parameter Prediction for Unseen Deep Architectures

Parameter Prediction for Unseen Deep Architectures

Towards Theoretically Inspired Neural Initialization Optimization

Contact Info

Product

Resources

About