Towards Efficient Training for Neural Network Quantization

Jin, Qing; Yang, Linjie; Liao, Zhenyu

doi:10.48550/arxiv.1912.10207

Cited by 15 publications

(31 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the initialization of quantization parameters, such as quantization scales and bit-widths, we set the initial bitwidths to be N + 1 when the complexity constraint is N -bit except that the patch embedding (first) layer and the classification (last) layer are 8-bit. Note that unlike previous works [21,40], the quantization parameters of first and last layer are optimized and are not fixed during the training. We initialize the all scales in the switchable scale vectors using a typical MSE-based approach.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…For a further understanding about how Q-ViT works and its mechanism behind the performance improvements, we visualize the learned bit-width allocation of different components in ViT. First & Last layers: As we mentioned before, different from previous work [21,40], we enable bit-width learning for the patch embedding (first) layer and the classification layer (last) layer in ViT. The standard practice in quantization is that the first and last layer in a deep neural network are allocated with high bit-width, e.g.…”

Section: Learned Bit-width Allocationmentioning

confidence: 99%

See 1 more Smart Citation

Q-ViT: Fully Differentiable Quantization for Vision Transformer

Li¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose a fully differentiable quantization method for vision transformer (ViT) named as Q-ViT, in which both of the quantization scales and bit-widths are learnable parameters. Specifically, based on our observation that heads in ViT display different quantization robustness, we leverage head-wise bit-width to squeeze the size of Q-ViT while preserving performance. In addition, we propose a novel technique named switchable scale to resolve the convergence problem in the joint training of quantization scales and bit-widths. In this way, Q-ViT pushes the limits of ViT quantization to 3-bit without heavy performance drop. Moreover, we analyze the quantization robustness of every architecture component of ViT and show that the Multi-head Self-Attention (MSA) and the Gaussian Error Linear Units (GELU) are the key aspects for ViT quantization. This study provides some insights for further research about ViT quantization. Extensive experiments on different ViT models, such as DeiT and Swin Transformer show the effectiveness of our quantization method. In particular, our method outperforms the state-of-the-art uniform quantization method by 1.5% on DeiT-Tiny.

show abstract

Section: Implementation Detailsmentioning

confidence: 99%

Section: Learned Bit-width Allocationmentioning

confidence: 99%

Q-ViT: Fully Differentiable Quantization for Vision Transformer

Li¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Fig. 4 depicts the evolution of bitwidth for each layer when quantizing a 4-bit ResNet18 using DDQ with with gradient calibration (Jain et al, 2019;Esser et al, 2020;Jin et al, 2019;Bhalgat et al, 2020).…”

Section: Evaluation On Imagenetmentioning

confidence: 99%

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution

Zhaoyang,

Wenqi,

Jinwei

et al. 2021

Preprint

View full text Add to dashboard Cite

Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune these values, we present a fully differentiable approach to learn all of them, named Differentiable Dynamic Quantization (DDQ), which has several benefits. ( 1) DDQ is able to quantize challenging lightweight architectures like Mo-bileNets, where different layers prefer different quantization parameters. (2) DDQ is hardwarefriendly and can be easily implemented using lowprecision matrix-vector multiplication, making it capable in many hardware such as ARM. (3) DDQ reduces training runtime by 25% compared to state-of-the-arts. Extensive experiments show that DDQ outperforms prior arts on many networks and benchmarks, especially when models are already efficient and compact. e.g. DDQ is the first approach that achieves lossless 4-bit quantization for MobileNetV2 on ImageNet.

show abstract

“…To accelerate inference and save storage space for huge models without sacrificing performance, previous works propose to compress models with techniques including weight pruning [24], channel slimming [43,44], layer skipping [4,73], patterned or block pruning [17,35,40,42,49,50,51,52,56,57,82,84], and network quantization [12,18,30,31,32,38,75]. Specifically, these studies elaborate on compressing discriminative models for image classification, detection, or segmentation tasks.…”

Section: Introductionmentioning

confidence: 99%

Teachers Do More Than Teach: Compressing Image-to-Image Models

Jin¹,

Ren²,

Woodford³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Generative Adversarial Networks (GANs) have achieved huge success in generating high-fidelity images, however, they suffer from low efficiency due to tremendous computational cost and bulky memory usage. Recent efforts on compression GANs show noticeable progress in obtaining smaller generators by sacrificing image quality or involving a time-consuming searching process. In this work, we aim to address these issues by introducing a teacher network that provides a search space in which efficient network architectures can be found, in addition to performing knowledge distillation. First, we revisit the search space of generative models, introducing an inception-based residual block into generators. Second, to achieve target computation cost, we propose a one-step pruning algorithm that searches a student architecture from the teacher model and substantially reduces searching cost. It requires no 1 sparsity regularization and its associated hyper-parameters, simplifying the training procedure. Finally, we propose to distill knowledge through maximizing feature similarity between teacher and student via an index named Global Kernel Alignment (GKA). Our compressed networks achieve similar or even better image fidelity (FID, mIoU) than the original models with much-reduced computational cost, e.g., MACs. Code will be released at https://github.com/snap-research/CAT.

show abstract

Towards Efficient Training for Neural Network Quantization

Cited by 15 publications

References 25 publications

Q-ViT: Fully Differentiable Quantization for Vision Transformer

Q-ViT: Fully Differentiable Quantization for Vision Transformer

Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution

Teachers Do More Than Teach: Compressing Image-to-Image Models

Contact Info

Product

Resources

About