Simulated Quantization, Real Power Savings

Baalen, Mart van; Kahne, Brian; Mahurin, Eric; Kuzmin, Andrey; Skliar, Andrii; Nagel, Markus; Blankevoort, Tijmen

doi:10.1109/cvprw56347.2022.00311

Cited by 5 publications

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 summarizes different LLMs; their release dates, sizes, and numbers of pretrained tokens; and their capabilities. Quantization is a technique to compress a model by converting the weights and activations within an LLM from a high-precision data representation to lower-precision data representation [52]. There are two types of LLM quantization: post-training quantization (PTQ) and quantization-aware training (QAT) [53].…”

Section: Large Language Modelsmentioning

confidence: 99%

Evaluating Quantized Llama 2 Models for IoT Privacy Policy Language Generation

Malisetty,

Perez

2024

Future Internet

View full text Add to dashboard Cite

Quantized large language models are large language models (LLMs) optimized for model size while preserving their efficacy. They can be executed on consumer-grade computers without the powerful features of dedicated servers needed to execute regular (non-quantized) LLMs. Because of their ability to summarize, answer questions, and provide insights, LLMs are being used to analyze large texts/documents. One of these types of large texts/documents are Internet of Things (IoT) privacy policies, which are documents specifying how smart home gadgets, health-monitoring wearables, and personal voice assistants (among others) collect and manage consumer/user data on behalf of Internet companies providing services. Even though privacy policies are important, they are difficult to comprehend due to their length and how they are written, which makes them attractive for analysis using LLMs. This study evaluates how quantized LLMs are modeling the language of privacy policies to be potentially used to transform IoT privacy policies into simpler, more usable formats, thus aiding comprehension. While the long-term goal is to achieve this usable transformation, our work focuses on evaluating quantized LLM models used for IoT privacy policy language. Particularly, we study 4-bit, 5-bit, and 8-bit quantized versions of the large language model Meta AI version 2 (Llama 2) and the base Llama 2 model (zero-shot, without fine-tuning) under different metrics and prompts to determine how well these quantized versions model the language of IoT privacy policy documents by completing and generating privacy policy text.

show abstract

Section: Large Language Modelsmentioning

confidence: 99%

Evaluating Quantized Llama 2 Models for IoT Privacy Policy Language Generation

Malisetty,

Perez

2024

Future Internet

View full text Add to dashboard Cite

show abstract

“…For both "signed Qi.f " and "unsigned Qi.f " formats, 1 bit for the integer part is required for representing the maximum possible weight value (i.e., w = 1). Quantization Steps: We quantize only the weights through the simulated quantization approach, which represents the weight values under fixed-point format, and performing computations under floating-point format (Jacob et al, 2018;Krishnamoorthi, 2018;Gholami et al, 2021;van Baalen et al, 2022). To perform quantization, we convert the weight values from 32-bit floating-point format (FP32) to 8-bit fixed-point format (signed Q1.6) by constructing their 8-bit binary representations under 32-bit integer format (INT32), thereby conveniently performing bit-wise modification and rounding operation while considering the sign and the rounding scheme (i.e., truncation).…”

Section: Quantizing the Snn Weightsmentioning

confidence: 99%

EnforceSNN: Enabling resilient and energy-efficient spiking neural network inference considering approximate DRAMs for embedded systems

2022

View full text Add to dashboard Cite

Spiking Neural Networks (SNNs) have shown capabilities of achieving high accuracy under unsupervised settings and low operational power/energy due to their bio-plausible computations. Previous studies identified that DRAM-based off-chip memory accesses dominate the energy consumption of SNN processing. However, state-of-the-art works do not optimize the DRAM energy-per-access, thereby hindering the SNN-based systems from achieving further energy efficiency gains. To substantially reduce the DRAM energy-per-access, an effective solution is to decrease the DRAM supply voltage, but it may lead to errors in DRAM cells (i.e., so-called approximate DRAM). Toward this, we propose EnforceSNN, a novel design framework that provides a solution for resilient and energy-efficient SNN inference using reduced-voltage DRAM for embedded systems. The key mechanisms of our EnforceSNN are: (1) employing quantized weights to reduce the DRAM access energy; (2) devising an efficient DRAM mapping policy to minimize the DRAM energy-per-access; (3) analyzing the SNN error tolerance to understand its accuracy profile considering different bit error rate (BER) values; (4) leveraging the information for developing an efficient fault-aware training (FAT) that considers different BER values and bit error locations in DRAM to improve the SNN error tolerance; and (5) developing an algorithm to select the SNN model that offers good trade-offs among accuracy, memory, and energy consumption. The experimental results show that our EnforceSNN maintains the accuracy (i.e., no accuracy loss for BER ≤ 10−3) as compared to the baseline SNN with accurate DRAM while achieving up to 84.9% of DRAM energy saving and up to 4.1x speed-up of DRAM data throughput across different network sizes.

show abstract

Post-training quantization for re-parameterization via coarse & fine weight splitting

Yang,

He,

et al. 2024

Journal of Systems Architecture

View full text Add to dashboard Cite

Simulated Quantization, Real Power Savings

Cited by 5 publications

References 6 publications

Evaluating Quantized Llama 2 Models for IoT Privacy Policy Language Generation

Evaluating Quantized Llama 2 Models for IoT Privacy Policy Language Generation

EnforceSNN: Enabling resilient and energy-efficient spiking neural network inference considering approximate DRAMs for embedded systems

Post-training quantization for re-parameterization via coarse & fine weight splitting

Contact Info

Product

Resources

About