Up or Down? Adaptive Rounding for Post-Training Quantization

Nagel, Markus; Amjad, Rana Ali; Baalen, Mart van; Louizos, Christos; Blankevoort, Tijmen

doi:10.48550/arxiv.2004.10568

Cited by 10 publications

(23 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Post-Training Quantization Post-training quantization [3,1,24,23] needs no training but a subset of dataset for calibrating the quantization parameters, including the clipping threshold and bias correction. Commonly, the quantization parameters of both weight and activation are decided before inference.…”

Section: Quantization Methodsmentioning

confidence: 99%

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Tao,

Lin,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.

show abstract

Section: Quantization Methodsmentioning

confidence: 99%

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Tao,

Lin,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Post-training quantization aims to quantize neural networks using a small part of the dataset (in some cases no data at all) for calibration of quantization parameters to ensure a certain local criterion (e.g., correspondence of minimum and maximum, MSE minimality). Recent work [22] showed that minimizing the mean squared error (MSE) introduced in the preactivations might be considered (under certain assumptions) as the best possible local criterion and performed optimization of rounding policy based on it. Works [12] and [28] utilize the same local criterion but optimize weights and quantization parameters directly and employ per-channel weight quantization, thus considering a simplified task.…”

Section: Quantization Methodsmentioning

confidence: 99%

“…At the same time, it does not involve significant overhead in computations. Nevertheless, considerable research efforts were concentrated on eliminating the need for per-channel quantization of weights to simplify the implementation of quantized operations [21,22]. In our work, we investigate the importance of per-channel quantization for GANs.…”

Section: Per-chanel and Per-tensor Weight Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

Quantization of Generative Adversarial Networks for Efficient Inference: a Methodological Study

Andreev¹,

Fritzler²,

Vetrov³

2021

Preprint

View full text Add to dashboard Cite

Generative adversarial networks (GANs) have an enormous potential impact on digital content creation, e.g., photo-realistic digital avatars, semantic content editing, and quality enhancement of speech and images. However, the performance of modern GANs comes together with massive amounts of computations performed during the inference and high energy consumption. That complicates, or even makes impossible, their deployment on edge devices. The problem can be reduced with quantization-a neural network compression technique that facilitates hardware-friendly inference by replacing floating-point computations with low-bit integer ones. While quantization is well established for discriminative models, the performance of modern quantization techniques in application to GANs remains unclear. GANs generate content of a more complex structure than discriminative models, and thus quantization of GANs is significantly more challenging. To tackle this problem, we perform an extensive experimental study of state-of-art quantization techniques on three diverse GAN architectures, namely StyleGAN, Self-Attention GAN, and CycleGAN. As a result, we discovered practical recipes that allowed us to successfully quantize these models for inference with 4/8-bit weights and 8-bit activations while preserving the quality of the original full-precision models.

show abstract

“…[5,45] even do quantization without accessing any real data. [63,64] adopt intermediate feature-map reconstruction to optimize the rounding policy.…”

Section: Related Workmentioning

confidence: 99%

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Li¹,

Shen²,

Ma³

et al. 2021

Preprint

View full text Add to dashboard Cite

Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization Benchmark (MQBench), a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights. By aligning the training settings, we find existing algorithms have about the same performance on the conventional academic track. While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled. Surprisingly, no existing algorithm wins every challenge in MQBench, and we hope this work could inspire future research directions.

show abstract

Up or Down? Adaptive Rounding for Post-Training Quantization

Cited by 10 publications

References 0 publications

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

Quantization of Generative Adversarial Networks for Efficient Inference: a Methodological Study

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Contact Info

Product

Resources

About