Neural Networks Compression for Language Modeling

Grachev, Artem M.; Ignatov, Dmitry I.; Savchenko, Andrey V.

doi:10.1007/978-3-319-69900-4_44

Cited by 23 publications

(7 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the significantly large number of weights incurs massive computation and storage burden, hindering the deployment of the state-of-the-art deep learning methods on resource-constrained platforms, such as mobile phones and embedded devices. It has been extensively studied and shown that there exists inherent redundancy in these weights, and there have been increasing research efforts on removing this redundancy, which is known as weight pruning [13], [14], [29], [30].…”

Section: B Network Optimizationmentioning

confidence: 99%

“…[28], [31]. Low-rank matrix factorization [14], [29] is another way of pruning by decomposing the original weight matrix into the linear composition of a set of low-rank weight matrices. Even though these methods can achieve good compression ratio by constraining the rank to a small number, they also incur significant (>3%) accuracy loss.…”

Section: B Network Optimizationmentioning

confidence: 99%

See 1 more Smart Citation

3D Capsule Networks for Object Classification With Weight Pruning

et al. 2020

View full text Add to dashboard Cite

The proliferation of 3D sensors, due to the increased demand for 3D data, induced the 3D computer vision research in the last decade, and 3D data processing has gained a lot of interest. As in many other applications in computer vision, deep learning-based methods were quickly applied to 3D data classification and have become the state-of-the-art in this area. More recently, capsule networks, which are novel neural structures, have been introduced to enhance the ability of neural networks to better capture the parts-relationship, which yields more accurate classification with less training data. Moreover, deploying deep machine learning models on mobile platforms requires the models to be optimized due to limited memory and computational constraints. In this work, we propose methods to boost the accuracies of a standard 3D CNN-based and a Capsule Network-based classifier, help the training to better generalize the data distribution with limited data, and optimize the models for resource-constrained environments, such as mobile platforms. We also show that the introduction of capsules to 3D object classification pipeline improves the classification performance with limited training data, while a specifically optimized weight pruning method keeps the model compact enough for mobile deployment. Our broad spectrum of experiments show that proposed methods improve the performance of the base model while significantly reducing the memory and computation requirements.

show abstract

Section: B Network Optimizationmentioning

confidence: 99%

Section: B Network Optimizationmentioning

confidence: 99%

3D Capsule Networks for Object Classification With Weight Pruning

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Some of them were successfully applied to audio processing [17] and image processing [40]. However, they are not yet well-studied in the language modeling task [14].…”

Section: Pruning and Quantizationmentioning

confidence: 99%

“…Similarly we can apply TT-decomposition to each matrix of LSTM layer (11)- (14) or the matrix of the output layer (4). Moreover, according to [41], the matrix-by-vector product and matrix sum can be efficiently implemented directly in the TT format without the need to convert these matrices to the TT.…”

Section: Tensor Train Decompositionmentioning

confidence: 99%

“…It was experimentally shown that compression techniques based on matrix factorization sufficiently speed up the inference in the language model. This article is an extended version of our conference paper [14]. In comparison with our previous paper we: 1) formulated the methodology for compression of language models; 2) presented an approach to solving the high-dimensional output problem; 3) significantly extended the survey of related works and references section; 4) provided plenty of new experiments for compression of conventional baselines (i.e., those that were described by Zaremba et al [15]) including their several extensions; 5) measured inference time for GPU computations on a real mobile device.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Compression of recurrent neural networks for efficient language modeling

Grachev

Ignatov

Savchenko

2019

Applied Soft Computing

Self Cite

View full text Add to dashboard Cite

Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications. In this paper we consider several compression techniques for recurrent neural networks including Long-Short Term Memory models. We make particular attention to the high-dimensional output problem caused by the very large vocabulary size. We focus on effective compression methods in the context of their exploitation on devices: pruning, quantization, and matrix decomposition approaches (low-rank factorization and tensor train decomposition, in particular). For each model we investigate the trade-off between its size, suitability for fast inference and perplexity. We propose a general pipeline for applying the most suitable methods to compress recurrent neural networks for language modeling. It has been shown in the experimental study with the Penn Treebank (PTB) dataset that the most efficient results in terms of speed and compression-perplexity balance are obtained by matrix decomposition techniques.

show abstract

Compress Polyphone Pronunciation Prediction Model with Shared Labels

Chen

Wang

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

It is well known that deep learning model has huge parameters and is computationally expensive, especially for embedded and mobile devices. Polyphone pronunciations selection is a basic function for Chinese Text-to-Speech (TTS) application. Recurrent neural network (RNN) is a good sequence labeling solution for polyphone pronunciation selection. However, huge parameters and computation make compression needed to alleviate its disadvantages. Meanwhile, Largescale-labels classification leads to more complicated network and heavy computation cost. In contrast to existing quantization with low precision data format and projection layer, we propose a novel method based on shared labels, which focuses on compressing the fully-connected layer before Softmax for models with a huge number of labels in TTS polyphone selection. The basic idea is to compress large number of target labels into a few label clusters, which will share the parameters of fully-connected layer. Furthermore, we combine it with other methods to further compress the polyphone pronunciation selection model. The experimental result shows that for Bi-LSTM (Bidirectional Long Short Term Memory) based polyphone selection, shared labels model decreases about 52% of original model size and accelerates prediction by 44% almost without performance loss.It is worth mentioning that the proposed method can be applied for other tasks to compress model and accelerate calculation.

show abstract

Neural Networks Compression for Language Modeling

Cited by 23 publications

References 8 publications

3D Capsule Networks for Object Classification With Weight Pruning

3D Capsule Networks for Object Classification With Weight Pruning

Compression of recurrent neural networks for efficient language modeling

Compress Polyphone Pronunciation Prediction Model with Shared Labels

Contact Info

Product

Resources

About