“…Post-Training Quantization Post-training quantization [3,1,24,23] needs no training but a subset of dataset for calibrating the quantization parameters, including the clipping threshold and bias correction. Commonly, the quantization parameters of both weight and activation are decided before inference.…”
Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.
“…Post-Training Quantization Post-training quantization [3,1,24,23] needs no training but a subset of dataset for calibrating the quantization parameters, including the clipping threshold and bias correction. Commonly, the quantization parameters of both weight and activation are decided before inference.…”
Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. ( 2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation 1 already achieves 70.5% and 69.2% top-1 accuracy on ImageNet without bells and whistles, outperforming recent state-of-the-art by reducing 54.9× and 45.7× computations against full-precision models. We hope FAT provide a novel perspective for model quantization. Code is available at https://github.com/ChaofanTao/ FAT_Quantization.
“…Post-training quantization aims to quantize neural networks using a small part of the dataset (in some cases no data at all) for calibration of quantization parameters to ensure a certain local criterion (e.g., correspondence of minimum and maximum, MSE minimality). Recent work [22] showed that minimizing the mean squared error (MSE) introduced in the preactivations might be considered (under certain assumptions) as the best possible local criterion and performed optimization of rounding policy based on it. Works [12] and [28] utilize the same local criterion but optimize weights and quantization parameters directly and employ per-channel weight quantization, thus considering a simplified task.…”
Section: Quantization Methodsmentioning
confidence: 99%
“…At the same time, it does not involve significant overhead in computations. Nevertheless, considerable research efforts were concentrated on eliminating the need for per-channel quantization of weights to simplify the implementation of quantized operations [21,22]. In our work, we investigate the importance of per-channel quantization for GANs.…”
Section: Per-chanel and Per-tensor Weight Quantizationmentioning
confidence: 99%
“…Another approach to post-training quantization of GANs is built upon recent works [17,5,22]. Consider a generator model…”
Generative adversarial networks (GANs) have an enormous potential impact on digital content creation, e.g., photo-realistic digital avatars, semantic content editing, and quality enhancement of speech and images. However, the performance of modern GANs comes together with massive amounts of computations performed during the inference and high energy consumption. That complicates, or even makes impossible, their deployment on edge devices. The problem can be reduced with quantization-a neural network compression technique that facilitates hardware-friendly inference by replacing floating-point computations with low-bit integer ones. While quantization is well established for discriminative models, the performance of modern quantization techniques in application to GANs remains unclear. GANs generate content of a more complex structure than discriminative models, and thus quantization of GANs is significantly more challenging. To tackle this problem, we perform an extensive experimental study of state-of-art quantization techniques on three diverse GAN architectures, namely StyleGAN, Self-Attention GAN, and CycleGAN. As a result, we discovered practical recipes that allowed us to successfully quantize these models for inference with 4/8-bit weights and 8-bit activations while preserving the quality of the original full-precision models.
“…[5,45] even do quantization without accessing any real data. [63,64] adopt intermediate feature-map reconstruction to optimize the rounding policy.…”
Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization Benchmark (MQBench), a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights. By aligning the training settings, we find existing algorithms have about the same performance on the conventional academic track. While for the hardware-deployable quantization, there is a huge accuracy gap which remains unsettled. Surprisingly, no existing algorithm wins every challenge in MQBench, and we hope this work could inspire future research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.