Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

Zhen, Kai; Nguyen, Hieu Duy; Chinta, Raviteja; Susanj, Nathan; Athanasios, Mouchtaris,; Afzal, Tariq; Rastrow, Ariya

doi:10.21437/interspeech.2022-874

Cited by 7 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Consequently, R(W) cannot enforce each weight to be replaced by the closest centroid in z as w = arg min i ||w − z i || , for w ∈ W and z i ∈ z, because the min operator is not differentiable. Recent BP-QAT methods force weights to approach the centroid in z using R(W) = w∈W D(w, z), where the differentiable dissimilarity function D is based on a cosine function in [1,14].…”

Section: Related Qat Approachesmentioning

confidence: 99%

“…FP-QAT [11] quantizes the model weights during forward propagation to pre-defined quantization centroids. BP-QAT [1,14,15] relies on customized regularizers to gradually force weights to those quantization centroids (i.e., "soft quantization" via gradient) during training before hard compression performs in the late training phase. As model weights are informed by the customized regularizers to move closer to where they are quantized at runtime per training step, the predictive performance is often well preserved.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract

Section: Related Qat Approachesmentioning

confidence: 99%