SPIQ: Data-Free Per-Channel Static Input Quantization

Yvinec, Edouard; Dapogny, Arnaud; Cord, Matthieu; Bailly, Kévin

doi:10.48550/arxiv.2203.14642

Cited by 2 publications

(4 citation statements)

References 30 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Table 3, we report our extensive study of post-training W4/A4 quantization techniques on convolutional neural networks (ResNets, MobileNets and EfficientNets) as well as transformers from ViT b16 (86M parameters) to ViT h14 (600M parameters). In this extreme compression regime, we observe the limits of previous state-of-the art methods SQuant [8] and SPIQ [45]. This is not the case for Pow-erQuant which already achieves strong results on ResNets and transformers and, as such, offers a very strong baseline for the proposed NUPES method.…”

Section: Main Result: Comparison To Other Gptq Methodsmentioning

confidence: 74%

“…In Table 5, we report our results for several large language models on common sense reasoning tasks. We do not use group-wise quantization as it leads to incompatibility with activation quantization due to the constraint of dimensionality as explained in SPIQ [45]. In other words, while we can demonstrate that group-wise quantization can lead to higher compression rate for the weights, such methods are bound to never quantize the activations.…”

Section: Quantization At All Sizes: Handling Outliersmentioning

confidence: 99%

“…In its most practical setup from the user perspective, but most challenging in terms of compression, quantization only leverages the weight values of a pre-trained model. Data-free quantization techniques [8], [35], [45], [51] are motivated by the growing concerns regarding data privacy as well as their scalability with respect to the model size. On the other end of the spectrum, quantization aware training techniques [9], [10], [50] representations at the cost of a full retraining burdened by extra operations.…”

Section: Introductionmentioning

confidence: 99%

“…As opposed to most quantization methods [28], [35], [45], [50] which, for the sake of practicality, map floating point values to an evenly spread, discrete space, non-uniform quantization achieves a tighter fit to the target distribution by using a non-uniform spread quantized target space. However such methods [2], [33] require custom implementations that significantly shift away from uniform quantization.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SPIQ: Data-Free Per-Channel Static Input Quantization

Yvinec¹,

Dapogny

Cord³

et al. 2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations. Our empirical benchmarks highlight the ability of NUPES to circumvent the limitations of previous post-training quantization techniques on transformers and large language models in particular.

show abstract

Section: Main Result: Comparison To Other Gptq Methodsmentioning

confidence: 74%

Section: Quantization At All Sizes: Handling Outliersmentioning

confidence: 99%