The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022
DOI: 10.1109/cvprw56347.2022.00311
|View full text |Cite
|
Sign up to set email alerts
|

Simulated Quantization, Real Power Savings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 6 publications
0
1
0
Order By: Relevance
“…Table 1 summarizes different LLMs; their release dates, sizes, and numbers of pretrained tokens; and their capabilities. Quantization is a technique to compress a model by converting the weights and activations within an LLM from a high-precision data representation to lower-precision data representation [52]. There are two types of LLM quantization: post-training quantization (PTQ) and quantization-aware training (QAT) [53].…”
Section: Large Language Modelsmentioning
confidence: 99%
“…Table 1 summarizes different LLMs; their release dates, sizes, and numbers of pretrained tokens; and their capabilities. Quantization is a technique to compress a model by converting the weights and activations within an LLM from a high-precision data representation to lower-precision data representation [52]. There are two types of LLM quantization: post-training quantization (PTQ) and quantization-aware training (QAT) [53].…”
Section: Large Language Modelsmentioning
confidence: 99%
“…For both "signed Qi.f " and "unsigned Qi.f " formats, 1 bit for the integer part is required for representing the maximum possible weight value (i.e., w = 1). Quantization Steps: We quantize only the weights through the simulated quantization approach, which represents the weight values under fixed-point format, and performing computations under floating-point format (Jacob et al, 2018;Krishnamoorthi, 2018;Gholami et al, 2021;van Baalen et al, 2022). To perform quantization, we convert the weight values from 32-bit floating-point format (FP32) to 8-bit fixed-point format (signed Q1.6) by constructing their 8-bit binary representations under 32-bit integer format (INT32), thereby conveniently performing bit-wise modification and rounding operation while considering the sign and the rounding scheme (i.e., truncation).…”
Section: Quantizing the Snn Weightsmentioning
confidence: 99%