2022
DOI: 10.48550/arxiv.2206.01859
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 35 publications
(100 reference statements)
0
3
0
Order By: Relevance
“…In addition, our study focuses on generative tasks, and does not consider activation quantization, nor speedups in batched execution. These are natural directions for future work, and we believe this can be achieved with carefully-designed GPU kernels and extensions of existing complementary techniques [34,33].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In addition, our study focuses on generative tasks, and does not consider activation quantization, nor speedups in batched execution. These are natural directions for future work, and we believe this can be achieved with carefully-designed GPU kernels and extensions of existing complementary techniques [34,33].…”
Section: Discussionmentioning
confidence: 99%
“…To our knowledge, we are the first to show that extremely accurate language models with hundreds of billions of parameters can be quantized to 2.5 − 4 bits per component on average: prior post-training methods only remain accurate at 8 bits [34,5], while prior training-based techniques have only tackled models that are smaller by one to two orders of magnitude [33]. This high degree of compression may appear unsurprising, as these networks are overparametrized; yet, as we discuss in our detailed analysis of results, compression induces non-trivial tradeoffs between the accuracy of the language modeling (perplexity), bit-width, and the size of the original model.…”
Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning
confidence: 99%
“…However, most of those works need quantization-aware finetuning or even expensive quantizationaware knowledge distillation (Hinton, Vinyals, and Dean 2014). Due to the cost of training/finetuning LLMs (Polino, Pascanu, and Alistarh 2018;Jiao et al 2019;Tao et al 2022;Wu et al 2022Wu et al , 2023, it is a challenge for practitioners/researchers to do finetuning/distillation on those LLMs, particularly for models like GPT-3-175B (Brown et al 2020) and BLOOM-176B (Scao et al 2022).…”
Section: Related Workmentioning
confidence: 99%