Bayesian Compression for Natural Language Processing

Chirkova, Nadezhda; Lobacheva, Ekaterina; Vetrov, Dmitry

doi:10.18653/v1/d18-1319

Cited by 11 publications

(58 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This technique allows achieving better final performance of the model because such a train- 13.4 / 0.14 32.1% / 97.8% 91.84 27.2% LR for Softmax, (Grachev et al, 2019) 14.5 / 1.19 26.8 % / 81.7 % 84.12 N/A TT for Softmax, (Grachev et al, 2019) 14. (Chirkova et al, 2018) 3.2 / 0.12 43.3 % / 95.5 % 109.2 N/A DSVI-ARD (Ours)…”

Section: Training and Evaluationmentioning

confidence: 85%

“…, perplexity and accuracy 2 on the test set. The comparison of DSVI-ARD with other dense layers compression approaches revealed that our models can exhibit comparable perplexity quality while achieving much higher compression (in Grachev et al (2019) case) and even surpass models based on similar Bayesian compression techniques (in Chirkova et al (2018)…”

Section: Training and Evaluationmentioning

confidence: 86%

“…We have focused on the pruning in application to word-level language modeling as this task usually involves a large vocabulary, hence, causing weight matrices of the first and the last layers to be huge. Chirkova et al (2018) also consider Bayesian pruning in language modeling, though their approach is based on the Variational Dropout (VD) technique, which has been proved to be poorly theoretically justified (Hron et al, 2018), whereas ARD does not encounter these issues while maintaining similar efficacy (Kharitonov et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Kodryan

Grachev

Ignatov

et al. 2019

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Self Cite

View full text Add to dashboard Cite

Reduction of the number of parameters is one of the most important goals in Deep Learning. In this article we propose an adaptation of Doubly Stochastic Variational Inference for Automatic Relevance Determination (DSVI-ARD) for neural networks compression. We find this method to be especially useful in language modeling tasks, where large number of parameters in the input and output layers is often excessive. We also show that DSVI-ARD can be applied together with encoder-decoder weight tying allowing to achieve even better sparsity and performance. Our experiments demonstrate that more than 90% of the weights in both encoder and decoder layers can be removed with a minimal quality loss. * These two authors contributed equally; the ordering of their names was chosen arbitrarily. The work was done when the first author was an intern at the Samsung R&D Institute.

show abstract

Section: Training and Evaluationmentioning

confidence: 85%

Section: Training and Evaluationmentioning

confidence: 86%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Kodryan

Grachev

Ignatov

et al. 2019

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In Table 2, the state-of-the-art perplexities for language modeling problem are assembled. In addition, we present in the last two rows of this table the best known results (for the PTB dataset) of compressed RNNs using SparseVD method [33]. Here the number of parameters for the compressed model from the paper [33] is computed in line with the remaining models as follows.…”

Section: Compression Resultsmentioning

confidence: 99%

Compression of recurrent neural networks for efficient language modeling

Grachev

Ignatov

Savchenko

2019

Applied Soft Computing

View full text Add to dashboard Cite

Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications. In this paper we consider several compression techniques for recurrent neural networks including Long-Short Term Memory models. We make particular attention to the high-dimensional output problem caused by the very large vocabulary size. We focus on effective compression methods in the context of their exploitation on devices: pruning, quantization, and matrix decomposition approaches (low-rank factorization and tensor train decomposition, in particular). For each model we investigate the trade-off between its size, suitability for fast inference and perplexity. We propose a general pipeline for applying the most suitable methods to compress recurrent neural networks for language modeling. It has been shown in the experimental study with the Penn Treebank (PTB) dataset that the most efficient results in terms of speed and compression-perplexity balance are obtained by matrix decomposition techniques.

show abstract

“…The main advantage of the Bayesian sparsification techniques is that they have a small number of hyperparameters compared to pruningbased methods. As stated in (Chirkova et al, 2018), Bayesian compression also leads to a higher sparsity level Neklyudov et al, 2017;Louizos et al, 2017). Our proposed VVD is inspired by these predecessors to specifically tackle the vocabulary redundancy problem in NLP tasks.…”

Section: Related Workmentioning

confidence: 96%

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Chen¹,

Su²,

Shen³

et al. 2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

With the rapid development in deep learning, deep neural networks have been widely adopted in many real-life natural language applications. Under deep neural networks, a predefined vocabulary is required to vectorize text inputs. The canonical approach to select predefined vocabulary is based on the word frequency, where a threshold is selected to cut off the long tail distribution. However, we observed that such a simple approach could easily lead to under-sized vocabulary or oversized vocabulary issues. Therefore, we are interested in understanding how the end-task classification accuracy is related to the vocabulary size and what is the minimum required vocabulary size to achieve a specific performance. In this paper, we provide a more sophisticated variational vocabulary dropout (VVD) based on variational dropout to perform vocabulary selection, which can intelligently select the subset of the vocabulary to achieve the required performance. To evaluate different algorithms on the newly proposed vocabulary selection problem, we propose two new metrics: Area Under Accuracy-Vocab Curve and Vocab Size under X% Accuracy Drop. Through extensive experiments on various NLP classification tasks, our variational framework is shown to significantly outperform the frequency-based and other selection baselines on these metrics.

show abstract

Bayesian Compression for Natural Language Processing

Cited by 11 publications

References 9 publications

Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Efficient Language Modeling with Automatic Relevance Determination in Recurrent Neural Networks

Compression of recurrent neural networks for efficient language modeling

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Contact Info

Product

Resources

About