Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10809
|View full text |Cite
|
Sign up to set email alerts
|

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N :M structured spa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…Model compression has commonly been achieved through a number of methods such as sparsity pruning [6,10,11], low-bit quantization [12,13,14], knowledge distillation [15,16], and lowrank matrix factorization [17,18]. These techniques can typically be applied regardless of the model architecture which allows them to be generalized to different tasks.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Model compression has commonly been achieved through a number of methods such as sparsity pruning [6,10,11], low-bit quantization [12,13,14], knowledge distillation [15,16], and lowrank matrix factorization [17,18]. These techniques can typically be applied regardless of the model architecture which allows them to be generalized to different tasks.…”
Section: Related Workmentioning
confidence: 99%
“…However, without structured sparsity [19], the resulting model requires irregular memory access and without hardware support, memory usage and computation become inefficient. Quantization is typically applied to reduce model weights from 32-bit floating point values down to 8-bit integer values, and is also applied to lower quantization levels (i.e., 1-bit, 2-bit, or 4-bit [5,14]) and even mixed-precision quantization [20]. However, computations on low-bit quantization level models are not available on typical real-world hardware.…”
Section: Related Workmentioning
confidence: 99%