Confounding Tradeoffs for Neural Network Quantization

Garg, Sahaj; Jain, Anirudh; Lou, Joe; Nahmias, Mitchell A.

doi:10.48550/arxiv.2102.06366

Cited by 5 publications

(9 citation statements)

References 24 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While most research on data-free quantization [2,4,7,15,16,43,30] focuses on weight quantization, we provide empirical evidence that input quantization is responsible for a significant part of the accuracy loss, most notably on low bit representation, as illustrated in Fig 1 . Furthermore, we show that per-channel input range estimation allows tighter modelling of the full-precision distribution as compared to a per-example, dynamic approach. As a result, the proposed SPIQ (standing for Static Per-channel Input Quantization) method outperforms both static and dynamic approaches as well as existing state-of-the-art methods.…”

Section: Introductionmentioning

confidence: 87%

“…Rounding and truncating are the most common examples. As discussed in [17], quantization methods are classified as either data-driven [20,23,26,38,10,19] or data-free [2,4,7,15,16,43,30,8]. Data-driven methods have been shown to work remarkably well despite a coarse approximation of the continuous optimisation problem.…”

Section: Quantizationmentioning

confidence: 99%

“…SQuant minimizes the absolute sum of errors between tensor instead of scalar values. Similarly most data-free methods [2,4,7,15,16,43,30] focus solely on improving weight quantization.…”

Section: Data-free Quantizationmentioning

confidence: 99%

See 2 more Smart Citations

SPIQ: Data-Free Per-Channel Static Input Quantization

Yvinec¹,

Dapogny²,

Cord³

et al. 2022

Preprint

View full text Add to dashboard Cite

Computationally expensive neural networks are ubiquitous in computer vision and solutions for efficient inference have drawn a growing attention in the machine learning community. Examples of such solutions comprise quantization, i.e. converting the processing values (weights and inputs) from floating point into integers e.g. int8 or int4. Concurrently, the rise of privacy concerns motivated the study of less invasive acceleration methods, such as data-free quantization of pre-trained models weights and activations. Previous approaches either exploit statistical information to deduce scalar ranges and scaling factors for the activations in a static manner, or dynamically adapt this range on-the-fly for each input of each layers (also referred to as activations): the latter generally being more accurate at the expanse of significantly slower inference. In this work, we argue that static input quantization can reach the accuracy levels of dynamic methods by means of a per-channel input quantization scheme that allows one to more finely preserve cross-channel dynamics. We show through a thorough empirical evaluation on multiple computer vision problems (e.g. ImageNet classification, Pascal VOC object detection as well as CityScapes semantic segmentation) that the proposed method, dubbed SPIQ, achieves accuracies rivalling dynamic approaches with static-level inference speed, significantly outperforming state-of-the-art quantization methods on every benchmark.Preprint. Under review.

show abstract

Section: Introductionmentioning

confidence: 87%

Section: Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

SPIQ: Data-Free Per-Channel Static Input Quantization

Yvinec¹,

Dapogny²,

Cord³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…2) Post-Training Quantization: An alternative to the expensive QAT method is Post-Training Quantization (PTQ) which performs the quantization and the adjustments of the weights, without any fine-tuning [11,24,40,59,60,67,68,87,106,138,144,168,176,269]. As such, the overhead of PTQ is very low and often negligible.…”

Section: G Fine-tuning Methodsmentioning

confidence: 99%

A Survey of Quantization Methods for Efficient Neural Network Inference

Gholami¹,

Kim²,

Dong³

et al. 2022

Low-Power Computer Vision

394

143

View full text Add to dashboard Cite

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

show abstract

“…Post-training quantization (PTQ) enables the user to convert an already trained float model and quantize it without retraining [10,23,7,11]. However, it can also result in drastic reduction in model quality.…”

Section: Related Workmentioning

confidence: 99%

Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Abdolrashidi¹,

Wang²,

Agrawal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Quantization has become a popular technique to compress neural networks and reduce compute cost, but most prior work focuses on studying quantization without changing the network size. Many real-world applications of neural networks have compute cost and memory budgets, which can be traded off with model quality by changing the number of parameters. In this work, we use ResNet as a case study to systematically investigate the effects of quantization on inference compute cost-quality tradeoff curves. Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher accuracy; in other words, the bfloat16 compute cost-quality tradeoff curve is Pareto-dominated by the 4-bit and 8-bit curves, with models primarily quantized to 4-bit yielding the best Pareto curve. Furthermore, we achieve stateof-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%. We demonstrate the regularizing effect of quantization by measuring the generalization gap. The quantization method we used is optimized for practicality: It requires little tuning and is designed with hardware capabilities in mind. Our work motivates further research into optimal numeric formats for quantization, as well as the development of machine learning accelerators supporting these formats. As part of this work, we contribute a quantization library written in JAX, which is open-sourced at https : / / github . com / google -research / google-research/tree/master/aqt.

show abstract

Confounding Tradeoffs for Neural Network Quantization

Cited by 5 publications

References 24 publications

SPIQ: Data-Free Per-Channel Static Input Quantization

SPIQ: Data-Free Per-Channel Static Input Quantization

A Survey of Quantization Methods for Efficient Neural Network Inference

Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Contact Info

Product

Resources

About