I-BERT: Integer-only BERT Quantization

Kim, Sehoon; Gholaminejad, Amir; Yao, Zhewei; Mahoney, Michael W.; Keutzer, Eecs Kurt

doi:10.48550/arxiv.2101.01321

Cited by 7 publications

(17 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, calculating these statistics requires floating point operations that would prevent us from doing integeronly quantization. Therefore, in this work, we only use static quantization where we pre-compute the clipping ranges and fix them during inference as in [17,18,31]. It is straightforward to pre-compute the ranges for weights as they are fixed during inference.…”

Section: A Basic Quantization Methodsmentioning

confidence: 99%

“…Integer-only quantization [17,18,31] not only represents the model weights and activations with lowprecision integer values, but it also carries out the entire inference with integer arithmetic. Broadly speaking, the core of integer-only quantization is the linear property of the operations.…”

Section: B Integer-only Quantizationmentioning

confidence: 99%

“…First, quantization reduces the memory footprint by storing weights/activations in low-precision, e.g., INT8 uses 4× less memory relative to FP32. Second, quantization accelerates execution and decreases power consumption by using specialized hardware for low-precision arithmetic [17,18,31]. Following its success on many computer vision [8,17,31,33] and natural language processing tasks [18,27,32], there have been attempts to apply quantization to ASR models [4,22,26,30].…”

Section: Introductionmentioning

confidence: 99%

“…Second, quantization accelerates execution and decreases power consumption by using specialized hardware for low-precision arithmetic [17,18,31]. Following its success on many computer vision [8,17,31,33] and natural language processing tasks [18,27,32], there have been attempts to apply quantization to ASR models [4,22,26,30]. However, prior quantization work lacks two important features: integer-only quantization and zero-shot quantization.…”

Section: Introductionmentioning

confidence: 99%

“…Integer-only quantization [17,18,31] is a quantization scheme where all operations (e.g., convolution and matrix multiplication) are performed using low-precision integer arithmetic. To the best of our knowledge, all the previous quantization works for ASR models use simulated quantization, where all or part of operations are performed with floating point arithmetic.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Kim¹,

Gholami²,

Yao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

End-to-end neural network models achieve improved performance on various automatic speech recognition (ASR) tasks. However, these models perform poorly on edge hardware due to large memory and computation requirements. While quantizing model weights and/or activations to low-precision can be a promising solution, previous research on quantizing ASR models is limited. Most quantization approaches use floating-point arithmetic during inference; and thus they cannot fully exploit integer processing units, which use less power than their floating-point counterparts. Moreover, they require training/validation data during quantization for finetuning or calibration; however, this data may not be available due to security/privacy concerns. To address these limitations, we propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. In particular, we generate synthetic data whose runtime statistics resemble the real data, and we use it to calibrate models during quantization. We then apply Q-ASR to quantize QuartzNet-15x5 and JasperDR-10x5 without any training data, and we show negligible WER change as compared to the full-precision baseline models. For INT8-only quantization, we observe a very modest WER degradation of up to 0.29%, while we achieve up to 2.44× speedup on a T4 GPU. Furthermore, Q-ASR exhibits a large compression rate of more than 4× with small WER degradation.

show abstract

Section: A Basic Quantization Methodsmentioning

confidence: 99%

Section: B Integer-only Quantizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Kim¹,

Gholami²,

Yao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Mokey

Zadeh

Mahmoud

Abdelhadi

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of stateof-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area-and energy-efficient hardware accelerator design. Over a set of stateof-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers. CCS CONCEPTS• Computing methodologies → Natural language processing; • Computer systems organization → Neural networks.

show abstract

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Cai

Lin

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing, and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand to enable numerous edge AI applications. This article provides an overview of efficient deep learning methods, systems, and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.

show abstract

I-BERT: Integer-only BERT Quantization

Cited by 7 publications

References 46 publications

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Integer-only Zero-shot Quantization for Efficient Speech Recognition

Mokey

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Contact Info

Product

Resources

About