Integer-Only Zero-Shot Quantization for Efficient Speech Recognition

Kim, Sehoon; Gholami, Amir; Yao, Zhewei; Lee, Nick; Wang, Patrick; Nrusimha, Aniruddha; Zhai, Bohan; Gao, Tianren; Mahoney, Michael W.; Keutzer, Kurt

doi:10.1109/icassp43922.2022.9747552

Cited by 25 publications

(25 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lastly, a common application is to finetune (similar to training) BERT models to particular datasets. This not only decreases the model footprint and increases inference speed but adjusts the model to new data [2,31,73,53,74].…”

Section: Small Scale Low-precision Trainingmentioning

confidence: 99%

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

200

161

View full text Add to dashboard Cite

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.

show abstract

Section: Small Scale Low-precision Trainingmentioning

confidence: 99%

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

200

161

View full text Add to dashboard Cite

show abstract

“…It creates extreme asymmetric and unbalanced distributions by converting to the exponent. Therefore, many methods are devoted to designing specific quantizers for the quantization of Softmax output to maximize the information, such as Segmental quantizers [12,47], Logarithmic quantizers [22,28] or apply sparsification before quantization [20]. As shown in Figure 5, the Logarithmic quantizer has the largest 3.82 mutual information.…”

Section: Matthew-effect Preserving Quantizationmentioning

confidence: 99%

Towards Accurate Post-Training Quantization for Vision Transformer

Ding

Qin

Yan

et al. 2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the quantization influence for extremely low-bit representation, and (2) the existing quantization paradigm is unfriendly to the powerlaw distribution of Softmax. Based on these observations, we propose a novel Accurate Post-training Quantization framework for Vision Transformer, namely APQ-ViT. We first present a unified Bottom-elimination Blockwise Calibration scheme to optimize the calibration metric to perceive the overall quantization disturbance in a blockwise manner and prioritize the crucial quantization errors that influence more on the final output. Then, we design a Matthew-effect Preserving Quantization for Softmax to maintain the power-law character and keep the function of the attention mechanism. Comprehensive experiments on large-scale classification and detection datasets demonstrate that our APQ-ViT surpasses the existing post-training quantization methods by convincing margins, especially in lower bit-width settings (e.g., averagely up to 5.17% improvement for classification and 24.43% for detection on W4A4). We also highlight that APQ-ViT enjoys versatility and works well on diverse transformer variants. CCS CONCEPTS• Computing methodologies → Computer vision problems.

show abstract

“…Existing quantization methods can be post-training quantization (PTQ) or in-training / quantization aware training (QAT). PTQ is applied after the model training is complete by compressing models into 8-bit representations and is relatively well supported by various libraries [3,4,5,6,7,8], such as TensorFlow Lite [9] and AIMET [10] for on-device deployment. However, almost no existing PTQ supports customized quantization configurations to compress machine learning (ML) layers and kernels into sub-8-bit (S8B) regimes [11].…”

Section: Introductionmentioning

confidence: 99%

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract

Integer-Only Zero-Shot Quantization for Efficient Speech Recognition

Cited by 25 publications

References 16 publications

Robust fine-tuning of zero-shot models

Robust fine-tuning of zero-shot models

Towards Accurate Post-Training Quantization for Vision Transformer

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Contact Info

Product

Resources

About