4-bit Quantization of LSTM-based Speech Recognition Models

Fasoli, Andrea; Chen, Chia-Yu; Serrano, Mauricio J.; Sun, Xiaofeng; Wang, Naigang; Venkataramani, Swagath; Saon, George; Cui, Xiaodong; Kingsbury, Brian; Zhang, Wei; Tüske, Zoltán; Gopalakrishnan, Kailash

doi:10.48550/arxiv.2108.12074

Cited by 5 publications

(6 citation statements)

References 21 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, Mix-GEMM can be applied to all the DNNs quantized with any uniform affine quantization technique, and as such, any advancement in that area can be potentially leveraged by Mix-GEMM. For example, recent works [24], [65], [69] have demonstrated competitive quality of results for low mixed-precision quantization of BERT for Natural Language Processing (NLP), whose compute expansive kernels based on matrix-matrix multiplications could be accelerated exploiting Mix-GEMM.…”

Section: Experimental Evaluationmentioning

confidence: 99%

Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices

Reggiani

Pappalardo

Doblas

et al. 2023

2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Deep Neural Network (DNN) inference based on quantized narrow-precision integer data represents a promising research direction toward efficient deep learning computations on edge and mobile devices. On one side, recent progress of Quantization-Aware Training (QAT) frameworks aimed at improving the accuracy of extremely quantized DNNs allows achieving results close to Floating-Point 32 (FP32), and provides high flexibility concerning the data sizes selection. Unfortunately, current Central Processing Unit (CPU) architectures and Instruction Set Architectures (ISAs) targeting resource-constrained devices present limitations on the range of data sizes supported to compute DNN kernels.This paper presents Mix-GEMM, a hardware-software codesigned architecture capable of efficiently computing quantized DNN convolutional kernels based on byte and sub-byte data sizes. Mix-GEMM accelerates General Matrix Multiplication (GEMM), representing the core kernel of DNNs, supporting all data size combinations from 8-to 2-bit, including mixed-precision computations, and featuring performance that scale with the decreasing of the computational data sizes. Our experimental evaluation, performed on representative quantized Convolutional Neural Networks (CNNs), shows that a RISC-V based edge System-on-Chip (SoC) integrating Mix-GEMM achieves up to 1.3 TOPS/W in energy efficiency, and up to 13.6 GOPS in throughput, gaining from 5.3× to 15.1× in performance over the OpenBLAS GEMM frameworks running on a commercial RISC-V based edge processor. By performing synthesis and Place and Route (PnR) of the enhanced SoC in Global Foundries 22nm FDX technology, we show that Mix-GEMM only accounts for 1% of the overall area consumption.

show abstract

Section: Experimental Evaluationmentioning

confidence: 99%

Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices

Reggiani

Pappalardo

Doblas

et al. 2023

2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…A standard design approach to fit the ASR model under budget is to apply model quantization or network pruning to large models. Recent arXiv:2312.08553v3 [eess.AS] 16 Jan 2024 quantization studies [22,23,24,25] have shown that it is possible to quantize the ASR models to 4-bit and even 2-bit with only marginal performance loss. Similarly, in terms of network pruning, both unstructured and structured sparsity [17,18,19,20,21] patterns have seen reasonable performance at high sparsity level, through various algorithms based on iterative magnitude pruning.…”

Section: Related Workmentioning

confidence: 99%

“…From prior studies, we have seen success on end-to-end ASR compression through sparse network pruning [17,18,19,20,21] and model quantization [22,23,24,25] However, compressing these massive universal speech models can lead to new challenges on the top of regular end-to-end models. For example, USMs have much large model sizes, and therefore higher compression ratios are needed to reach the efficiency requirements for deployments.…”

Section: Introductionmentioning

confidence: 99%

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding¹,

Phoenix²,

He³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N :M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a largescale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.

show abstract

“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”

Section: Introductionmentioning

confidence: 99%

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract

4-bit Quantization of LSTM-based Speech Recognition Models

Cited by 5 publications

References 21 publications

Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices

Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Contact Info

Product

Resources

About