2021
DOI: 10.48550/arxiv.2108.12074
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

4-bit Quantization of LSTM-based Speech Recognition Models

Abstract: We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM -Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network -Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 21 publications
(32 reference statements)
0
3
0
Order By: Relevance
“…However, Mix-GEMM can be applied to all the DNNs quantized with any uniform affine quantization technique, and as such, any advancement in that area can be potentially leveraged by Mix-GEMM. For example, recent works [24], [65], [69] have demonstrated competitive quality of results for low mixed-precision quantization of BERT for Natural Language Processing (NLP), whose compute expansive kernels based on matrix-matrix multiplications could be accelerated exploiting Mix-GEMM.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…However, Mix-GEMM can be applied to all the DNNs quantized with any uniform affine quantization technique, and as such, any advancement in that area can be potentially leveraged by Mix-GEMM. For example, recent works [24], [65], [69] have demonstrated competitive quality of results for low mixed-precision quantization of BERT for Natural Language Processing (NLP), whose compute expansive kernels based on matrix-matrix multiplications could be accelerated exploiting Mix-GEMM.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…A standard design approach to fit the ASR model under budget is to apply model quantization or network pruning to large models. Recent arXiv:2312.08553v3 [eess.AS] 16 Jan 2024 quantization studies [22,23,24,25] have shown that it is possible to quantize the ASR models to 4-bit and even 2-bit with only marginal performance loss. Similarly, in terms of network pruning, both unstructured and structured sparsity [17,18,19,20,21] patterns have seen reasonable performance at high sparsity level, through various algorithms based on iterative magnitude pruning.…”
Section: Related Workmentioning
confidence: 99%
“…From prior studies, we have seen success on end-to-end ASR compression through sparse network pruning [17,18,19,20,21] and model quantization [22,23,24,25] However, compressing these massive universal speech models can lead to new challenges on the top of regular end-to-end models. For example, USMs have much large model sizes, and therefore higher compression ratios are needed to reach the efficiency requirements for deployments.…”
Section: Introductionmentioning
confidence: 99%
“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”
Section: Introductionmentioning
confidence: 99%