Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Wu, Xiaoyan; Yao, Zhewei; Zhang, Minjia; Li, Conglong; He, Yuxiong

doi:10.48550/arxiv.2206.01859

Cited by 2 publications

(3 citation statements)

References 35 publications

(100 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, our study focuses on generative tasks, and does not consider activation quantization, nor speedups in batched execution. These are natural directions for future work, and we believe this can be achieved with carefully-designed GPU kernels and extensions of existing complementary techniques [34,33].…”

Section: Discussionmentioning

confidence: 99%

“…To our knowledge, we are the first to show that extremely accurate language models with hundreds of billions of parameters can be quantized to 2.5 − 4 bits per component on average: prior post-training methods only remain accurate at 8 bits [34,5], while prior training-based techniques have only tackled models that are smaller by one to two orders of magnitude [33]. This high degree of compression may appear unsurprising, as these networks are overparametrized; yet, as we discuss in our detailed analysis of results, compression induces non-trivial tradeoffs between the accuracy of the language modeling (perplexity), bit-width, and the size of the original model.…”

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

See 1 more Smart Citation

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar¹,

Ashkboos²,

Hoefler³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for endto-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar¹,

Ashkboos²,

Hoefler³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, most of those works need quantization-aware finetuning or even expensive quantizationaware knowledge distillation (Hinton, Vinyals, and Dean 2014). Due to the cost of training/finetuning LLMs (Polino, Pascanu, and Alistarh 2018;Jiao et al 2019;Tao et al 2022;Wu et al 2022Wu et al , 2023, it is a challenge for practitioners/researchers to do finetuning/distillation on those LLMs, particularly for models like GPT-3-175B (Brown et al 2020) and BLOOM-176B (Scao et al 2022).…”

Section: Related Workmentioning

confidence: 99%

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Yao,

Wu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.

show abstract

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Cited by 2 publications

References 35 publications

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Contact Info

Product

Resources

About