2022
DOI: 10.48550/arxiv.2208.07339
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Abstract: Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
55
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 28 publications
(58 citation statements)
references
References 39 publications
0
55
0
Order By: Relevance
“…FP16 × INT4) on mainstream architectures. Moreover, our current results do not include activation quantization, as they are not a significant bottleneck in our target scenarios; however, this can be supported using complementary techniques [5,34].…”
Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning
confidence: 90%
See 4 more Smart Citations
“…FP16 × INT4) on mainstream architectures. Moreover, our current results do not include activation quantization, as they are not a significant bottleneck in our target scenarios; however, this can be supported using complementary techniques [5,34].…”
Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning
confidence: 90%
“…To our knowledge, we are the first to show that extremely accurate language models with hundreds of billions of parameters can be quantized to 2.5 − 4 bits per component on average: prior post-training methods only remain accurate at 8 bits [34,5], while prior training-based techniques have only tackled models that are smaller by one to two orders of magnitude [33]. This high degree of compression may appear unsurprising, as these networks are overparametrized; yet, as we discuss in our detailed analysis of results, compression induces non-trivial tradeoffs between the accuracy of the language modeling (perplexity), bit-width, and the size of the original model.…”
Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning
confidence: 99%
See 3 more Smart Citations