Neural Network Language Model Compression With Product Quantization and Soft Binarization

Yu, Kai; Ma, Rao; Shi, Kaiyu; Liu, Qi

doi:10.1109/taslp.2020.3015659

Cited by 10 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, to the best of our knowledge, this paper is the first work to apply mixed precision quantization methods to Transformer language models. In contrast, previous researches on low-bit quantization focused on convolutional neural networks (CNNs) [22] and LSTM-RNN LMs [23], where expert designed special partially quantized linear layers containing binary weight matrices, full precision bias and additional scaling parameters were used mitigate the performance degradation due to uniform precision quantization.…”

Section: Introductionmentioning

confidence: 99%

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Low-bit deep neural network quantization techniques provides a powerful solution to dramatically reduce their model size. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. To this end, novel mixed precision DNN quantization methods are proposed in this paper. The optimal local precision settings are automatically learned using two techniques. The first is based on a quantization sensitivity metric in the form of Hessian trace weighted quantization perturbation. The second is based on mixed precision Transformer architecture search. Alternating direction methods of multipliers (ADMM) are used to efficiently train mixed precision quantized DNN systems. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system suggest the proposed mixed precision Transformer quantization techniques achieved model size compression ratios of up to 16 times over the full precision baseline with no recognition performance degradation. When being used to compress a larger full precision Transformer LM with more layers, overall word error rate (WER) reductions up to 1.7% absolute (18% relative) were obtained.

show abstract

Section: Introductionmentioning

confidence: 99%

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Another powerful family of techniques recently drawing increasing interest across the machine learning, computer vision and speech technology communities to solve this problem is to use low-bit DNN quantization techniques [31]- [37], [52], [57], [58], [62], [74], [75]. By replacing floating point based DNN parameters with low precision values, for example, binary numbers, model sizes can be dramatically reduced without changing the DNN architecture [32], [57], [73].…”

Section: Introductionmentioning

confidence: 99%

“…Further DNN size reduction can be obtained when low-precision quantization is used in combination with neural architecture search (NAS) techniques, for example, in the SqueezeNet system designed for computer vision tasks [52]. In contrast to the extensive prior research works on low-bit quantization methods primarily targeting computer vision tasks [31]- [37], [52], only limited previous research in this direction has been conducted in the context of language modelling [57], [58] and ASR systems [56], [59].…”

Section: Introductionmentioning

confidence: 99%

“…1) To the best of our knowledge, this paper presents the first work in the speech technology community to apply mixed precision DNN quantization techniques to both LSTM-RNN and Transformer based NNLMs. In contrast, prior researches within the speech community in this direction largely focused on uniform precision based quantization of convolutional neural networks (CNNs) acoustic models [62] and LSTM-RNN language models [57], [58], [75].…”

Section: Introductionmentioning

confidence: 99%

“…2) To the best of our knowledge, this paper is the first work to introduce ADMM based neural network quantization techniques for speech recognition tasks. In contrast, prior researches with the speech technology community on lowbit quantization of CNNs [62] and LSTM-RNN LMs [57], [58], [75] used the modified BP algorithm [31], [32] while the inconsistency between discrete, quantized parameters and gradient based SGD update remains unaddressed.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

Xu,

Yu,

et al. 2021

Preprint

View full text Add to dashboard Cite

State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-theart LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved "lossless" quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.

show abstract

Mixed Precision DNN Quantization for Overlapped Speech Separation and Recognition

Liu

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Neural Network Language Model Compression With Product Quantization and Soft Binarization

Cited by 10 publications

References 24 publications

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

Mixed Precision DNN Quantization for Overlapped Speech Separation and Recognition

Contact Info

Product

Resources

About