LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers, Tim; Lewis, Michael A.; Belkada, Younes; Zettlemoyer, Luke

doi:10.48550/arxiv.2208.07339

Cited by 28 publications

(58 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FP16 × INT4) on mainstream architectures. Moreover, our current results do not include activation quantization, as they are not a significant bottleneck in our target scenarios; however, this can be supported using complementary techniques [5,34].…”

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 90%

“…To our knowledge, we are the first to show that extremely accurate language models with hundreds of billions of parameters can be quantized to 2.5 − 4 bits per component on average: prior post-training methods only remain accurate at 8 bits [34,5], while prior training-based techniques have only tackled models that are smaller by one to two orders of magnitude [33]. This high degree of compression may appear unsurprising, as these networks are overparametrized; yet, as we discuss in our detailed analysis of results, compression induces non-trivial tradeoffs between the accuracy of the language modeling (perplexity), bit-width, and the size of the original model.…”

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

“…Figure 1: Quantizing OPT models to 4 and BLOOM models to 3 bit precision, comparing GPTQ with the FP16 baseline and round-to-nearest (RTN) [34,5].…”

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

“…With the recent open-source releases of models like BLOOM [16] or OPT-175B [35], researchers have started to develop affordable methods for compressing such giant networks for inference. To our knowledge, all existing works-ZeroQuant [34], LLM.int8() [5], and nuQmm [24]-employ relatively simple quantization schemes based on rounding to the nearest (RTN) quantization level. This simple approach has the advantage of maintaining acceptable runtimes for very large models.…”

Section: Large-model Quantizationmentioning

confidence: 99%

“…Unfortunately, the more accurate variants of such methods [17,14,8] are complex, and do not scale to billions of parameters [34]. To date, only basic variants of round-to-nearest quantization [34,5] have been applied at the scale of GPT-175B; while this works well for low compression targets, e.g., 8-bit weights, they fail to preserve accuracy at higher rates. It therefore remains open whether one-shot post-training quantization to higher compression rates is generally-feasible.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar¹,

Ashkboos²,

Hoefler³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highlyaccurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for endto-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

show abstract

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 90%

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

“…Figure 1: Quantizing OPT models to 4 and BLOOM models to 3 bit precision, comparing GPTQ with the FP16 baseline and round-to-nearest (RTN) [34,5].…”

Section: Bloom Model Family 3bit Rtn 3bit Gptq Fp16mentioning

confidence: 99%

Section: Large-model Quantizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar¹,

Ashkboos²,

Hoefler³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Geometric Tuning of Single‐Atom FeN₄ Sites via Edge‐Generation Enhances Multi‐Enzymatic Properties

et al. 2023

View full text Add to dashboard Cite

Single‐atom nanozymes (SAzymes) are considered promising alternatives to natural enzymes. The catalytic performance of SAzymes featuring homogeneous, well‐defined active structures can be enhanced through elucidating structure‐activity relationship and tailoring physicochemical properties. However, manipulating enzymatic properties through structural variation is an underdeveloped approach. Herein, the synthesis of edge‐rich Fe single‐atom nanozymes (FeNC‐edge) via an H2O2‐mediated edge generation is reported. By controlling the number of edge sites, the peroxidase (POD)‐ and oxidase (OXD)‐like performance is significantly enhanced. The activity enhancement results from the presence of abundant edges, which provide new anchoring sites to mononuclear Fe. Experimental results combined with density functional theory (DFT) calculations reveal that FeN4 moieties in the edge sites display high electron density of Fe atoms and open N atoms. Finally, it is demonstrated that FeNC‐edge nanozyme effectively inhibits tumor growth both in vitro and in vivo, suggesting that edge‐tailoring is an efficient strategy for developing artificial enzymes as novel catalytic therapeutics.

show abstract

Medicine-Engineering Interdisciplinary Research Based on Bibliometric Analysis: A Case Study on Medicine-Engineering Institutional Cooperation of Shanghai Jiao Tong University

Wang

Cui

Deng

2022

J. Shanghai Jiaotong Univ. (Sci.)

View full text Add to dashboard Cite

This article aims to provide reference for medicine-engineering interdisciplinary research. Targeted at the scientific literature and patent literature published by Shanghai Jiao Tong University, this article attempts to set up co-occurrence matrix of medicine-engineering institutional information which was extracted from address fields of the papers, so as to construct the medicine-engineering intersection datasets. The dataset of scientific literature was analyzed using bibliometrics and visualization methods from multiple dimensions, and the most active factors, such as trends of output, journal and subject distribution, were identified from the indicators of category normalized citation impact (CNCI), times cited, keywords, citation topics and the degree of medicine-engineering interdisplinary. Research on hotspots and trends was discussed in detail. Analyses of the dataset of patent literature showed research themes and measured the degree for technology convergence of medicine-engineering.

show abstract

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Cited by 28 publications

References 39 publications

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Geometric Tuning of Single‐Atom FeN₄ Sites via Edge‐Generation Enhances Multi‐Enzymatic Properties

Medicine-Engineering Interdisciplinary Research Based on Bibliometric Analysis: A Case Study on Medicine-Engineering Institutional Cooperation of Shanghai Jiao Tong University

Contact Info

Product

Resources

About

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Cited by 28 publications

References 39 publications

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Geometric Tuning of Single‐Atom FeN4 Sites via Edge‐Generation Enhances Multi‐Enzymatic Properties

Medicine-Engineering Interdisciplinary Research Based on Bibliometric Analysis: A Case Study on Medicine-Engineering Institutional Cooperation of Shanghai Jiao Tong University

Contact Info

Product

Resources

About

Geometric Tuning of Single‐Atom FeN₄ Sites via Edge‐Generation Enhances Multi‐Enzymatic Properties