4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding, Shaojin; Phoenix, Meadowlark,; He, You; Lew, Łukasz; Agrawal, Shivani; Rybakov, Oleg

doi:10.21437/interspeech.2022-10809

Cited by 9 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Model compression has commonly been achieved through a number of methods such as sparsity pruning [6,10,11], low-bit quantization [12,13,14], knowledge distillation [15,16], and lowrank matrix factorization [17,18]. These techniques can typically be applied regardless of the model architecture which allows them to be generalized to different tasks.…”

Section: Related Workmentioning

confidence: 99%

“…However, without structured sparsity [19], the resulting model requires irregular memory access and without hardware support, memory usage and computation become inefficient. Quantization is typically applied to reduce model weights from 32-bit floating point values down to 8-bit integer values, and is also applied to lower quantization levels (i.e., 1-bit, 2-bit, or 4-bit [5,14]) and even mixed-precision quantization [20]. However, computations on low-bit quantization level models are not available on typical real-world hardware.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Hernandez¹,

Zhao²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model size of Conformer-based speech recognition models which typically require models with greater than 100M parameters down to just 5M parameters while minimizing impact on model quality. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors. We propose model weight reuse at different levels within our model architecture: (i) repeating full conformer block layers, (ii) sharing specific conformer modules across layers, (iii) sharing sub-components per conformer module, and (iv) sharing decomposed sub-component weights after low-rank decomposition. By sharing weights at different levels of our model, we can retain the full model in-memory while increasing the number of virtual transformations applied to the input. Through a series of ablation studies and evaluations, we find that with weight sharing and a low-rank architecture, we can achieve a WER of 2.84 and 2.94 for Librispeech dev-clean and test-clean respectively with a 5M parameter model.

show abstract