Structured Sparsification of Gated Recurrent Neural Networks

Lobacheva, Ekaterina; Chirkova, Nadezhda; Markovich, Alexander; Vetrov, Dmitry

doi:10.1609/aaai.v34i04.5938

Cited by 10 publications

(8 citation statements)

References 14 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Earlier works have focused on inducing sparsity in standard feed-forward neural networks. Yet, Bayesian pruning methods have also been successfully applied to recurrent neural networks (RNNs) [Kodryan et al 2019;Lobacheva et al 2018]. Lobacheva et al [2018] use Sparse VD to prune individual weights of an LSTM or follow the approach from Louizos et al [2017] to sparsify neurons or gates and show results on text classification or language modeling problems.…”

Section: Variational Selection Schemesmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

View full text Add to dashboard Cite

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.. The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience -Albert Einstein, 1933 INTRODUCTIONDeep learning shows unparalleled promise for solving very complex real-world problems in areas such as computer vision, natural language processing, knowledge representation, recommendation systems, drug discovery, and many more. With this development, the field of machine learning is moving from traditional feature engineering to neural architecture engineering. However, still little is known about how to pick the right architecture to solve a specific task. Several methods such as translational equivariance in convolutional layers, recurrence, structured weight sharing, pooling, or locality are used to introduce strong inductive biases in the model design. Yet, the exact model size and capacity required for a task remain unknown and a common strategy is to train overparameterized models and compress them into smaller representations. Biological brains, especially the human brain, are hierarchical, sparse, and recurrent structures [Friston 2008] and one can draw some similarities with the inductive biases in today's artificial neural networks. Sparsity plays an important role in scaling biological brains-the more

show abstract

Section: Variational Selection Schemesmentioning

confidence: 99%

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler¹,

Alistarh²,

Ben-Nun³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In future work, we intend to investigate other types of priors over the network parameters (e.g., sparse priors (Lobacheva et al, 2017)). We would also like to explicitly quantify the uncertainty captured in our framework under different sampling strategies or MCMC-SG methods (e.g., similar to Mc-Clure and Kriegeskorte (2016); Teye et al (2018)).…”

Section: Discussionmentioning

confidence: 99%

Untitled

Shareghi

Zhu

et al. 2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

While neural dependency parsers provide stateof-the-art accuracy for several languages, they still rely on large amounts of costly labeled training data. We demonstrate that in the small data regime, where uncertainty around parameter estimation and model prediction matters the most, Bayesian neural modeling is very effective. In order to overcome the computational and statistical costs of the approximate inference step in this framework, we utilize an efficient sampling procedure via stochastic gradient Langevin dynamics to generate samples from the approximated posterior. Moreover, we show that our Bayesian neural parser can be further improved when integrated into a multi-task parsing and POS tagging framework, designed to minimize task interference via an adversarial procedure. When trained and tested on 6 languages with less than 5 training instances, our parser consistently outperforms the strong BiLSTM baseline (Kiperwasser and Goldberg, 2016). Compared with the BiAFFINE parser (Dozat et al., 2017) our model achieves an improvement of up to 3% for Vietnamese and Irish, while our multi-task model achieves an improvement of up to 9% across five languages: Farsi, Russian, Turkish, Vietnamese, and Irish.

show abstract

“…One more drawback of neural networks is that they are slower than classic ML algorithms such as linear models or boosting, and require more memory to store parameters. But there are several techniques aimed at reducing the time and memory complexity of the trained models [13,[22][23][24].…”

Section: Neural Networkmentioning

confidence: 99%

Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Bobrov¹,

Troshin²,

Chirkova³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Channel decoding, channel detection, channel assessment, and resource management for wireless multiple-input multiple-output (MIMO) systems are all examples of problems where machine learning (ML) can be successfully applied. In this paper, we study several ML approaches to solve the problem of estimating the spectral efficiency (SE) value for a certain precoding scheme, preferably in the shortest possible time. The best results in terms of mean average percentage error (MAPE) are obtained with gradient boosting over sorted features, while linear models demonstrate worse prediction quality. Neural networks perform similarly to gradient boosting, but they are more resource-and time-consuming because of hyperparameter tuning and frequent retraining. We investigate the practical applicability of the proposed algorithms in a wide range of scenarios generated by the Quadriga simulator. In almost all scenarios, the MAPE achieved using gradient boosting and neural networks is less than 10%.

show abstract

Structured Sparsification of Gated Recurrent Neural Networks

Cited by 10 publications

References 14 publications

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Untitled

Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Contact Info

Product

Resources

About