Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Kodryan, Maxim; Grachev, Artem M.; Ignatov, Dmitry I.; Vetrov, Dmitry

doi:10.18653/v1/w19-4306

Cited by 1 publication

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Earlier works have focused on inducing sparsity in standard feed-forward neural networks. Yet, Bayesian pruning methods have also been successfully applied to recurrent neural networks (RNNs) [Kodryan et al 2019;Lobacheva et al 2018]. Lobacheva et al [2018] use Sparse VD to prune individual weights of an LSTM or follow the approach from Louizos et al [2017] to sparsify neurons or gates and show results on text classification or language modeling problems.…”

Section: Variational Selection Schemesmentioning

confidence: 99%

“…Lobacheva et al [2018] use Sparse VD to prune individual weights of an LSTM or follow the approach from Louizos et al [2017] to sparsify neurons or gates and show results on text classification or language modeling problems. Kodryan et al [2019] use instead the Automatic Relevance Determination (ARD) framework, in which a zero-mean element-wise factorized Gaussian prior distribution over the parameters is used, together with a corresponding Gaussian factorized posterior, such that a closed-form expression of the KL divergence term of the variational lower bound is obtained. Subsequently, the Doubly Stochastic Variational Inference (DSVI) method is used to maximize the variational lower bound and the weights for which the prior variances are lower than a certain threshold are set to zero.…”

Section: Variational Selection Schemesmentioning

confidence: 99%

See 1 more Smart Citation

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

View full text Add to dashboard Cite

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.. The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience -Albert Einstein, 1933 INTRODUCTIONDeep learning shows unparalleled promise for solving very complex real-world problems in areas such as computer vision, natural language processing, knowledge representation, recommendation systems, drug discovery, and many more. With this development, the field of machine learning is moving from traditional feature engineering to neural architecture engineering. However, still little is known about how to pick the right architecture to solve a specific task. Several methods such as translational equivariance in convolutional layers, recurrence, structured weight sharing, pooling, or locality are used to introduce strong inductive biases in the model design. Yet, the exact model size and capacity required for a task remain unknown and a common strategy is to train overparameterized models and compress them into smaller representations. Biological brains, especially the human brain, are hierarchical, sparse, and recurrent structures [Friston 2008] and one can draw some similarities with the inductive biases in today's artificial neural networks. Sparsity plays an important role in scaling biological brains-the more

show abstract

Section: Variational Selection Schemesmentioning

confidence: 99%

Section: Variational Selection Schemesmentioning

confidence: 99%