Post-Training Sparsity-Aware Quantization

Shomron, Gil; Gabbay, Freddy; Kurzum, Samer; Weiser, Uri

doi:10.48550/arxiv.2105.11010

Cited by 4 publications

(3 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to [54], the usage of low-precision fixed integer value representation has the potential to reduce the memory footprint and latency by a factor of 16×. Despite being fast and very easy to use, the post-training quantization approach is not an option because it suffers from significant degradation in model accuracy in case of precision lower than 8 bits [55].…”

Section: Model Quantizationmentioning

confidence: 99%

Towards Real-Time Machine Learning-Based Signal/Background Selection in the CMS Detector Using Quantized Neural Networks and Input Data Reduction

Burazin Mišura,

Musić,

Prvan

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

The Large Hadron Collider (LHC) is being prepared for an extensive upgrade to boost its particle discovery potential. The new phase, High Luminosity LHC, will operate at a factor-of-five-increased luminosity (the number proportional to the rate of collisions). Consequently, such an increase in luminosity will result in enormous quantities of generated data that cannot be transmitted or stored with the currently available resources and time. However, the vast majority of the generated data consist of uninteresting data or pile-up data containing few interesting events or electromagnetic showers. High-Luminosity LHC detectors, including the Compact Muon Solenoid (CMS), will thus have to rely on innovative approaches like the proposed one to select interesting collision data. In charge of data reduction/selection at the early stages of data streaming is a level 1 trigger (L1T), a real-time event selection system. The final step of the L1T is a global trigger, which uses sub-system algorithms to make a final decision about signal acceptance/rejection within a decision time of around 12 microseconds. For one of these sub-system L1T algorithms, we propose using quantized neural network models deployed in targeted L1T devices, namely, field-programmable gate arrays (FPGA), as a classifier between electromagnetic and pile-up/quantum chromodynamics showers. The developed quantized neural network operates in an end-to-end manner using raw detector data to speed up the classification process. The proposed data reduction methods further decrease model size while retaining accuracy. The proposed approach was tested with simulated data (since the detector is still in the production stage) and took less than 1 microsecond, achieving real-time signal–background classification with a classification accuracy of 97.37% for 2-bit-only quantization and 97.44% for quantization augmented with the data reduction approach (compared to 98.61% for the full-precision, standard network ).

show abstract

Section: Model Quantizationmentioning

confidence: 99%

Towards Real-Time Machine Learning-Based Signal/Background Selection in the CMS Detector Using Quantized Neural Networks and Input Data Reduction

Burazin Mišura,

Musić,

Prvan

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Model quantization is an optimization technique that aims at transforming the higher-bit level weights to lower-bit level weights, e.g., from float32 weights to 8-bit integer weights, to reduce the size of the model for an easy model deployment. Multiple quantization approaches [32,33,38,53] have been proposed given its importance in DL-based engineering. An important part of quantization methods is the mapping between the two parts of weights.…”

Section: Model Quantizationmentioning

confidence: 99%

Characterizing and Understanding the Behavior of Quantized Models for Reliable Deployment

Hu¹,

Guo²,

Cordy³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) have gained considerable attention in the past decades due to their astounding performance in different applications, such as natural language modeling, self-driving assistance, and source code understanding. With rapid exploration, more and more complex DNN architectures have been proposed along with huge pre-trained model parameters. The common way to use such DNN models in user-friendly devices (e.g., mobile phones) is to perform model compression before deployment. However, recent research has demonstrated that model compression, e.g., model quantization, yields accuracy degradation as well as outputs disagreements when tested on unseen data. Since the unseen data always include distribution shifts and often appear in the wild, the quality and reliability of quantized models are not ensured. In this paper, we conduct a comprehensive study to characterize and help users understand the behaviors of quantized models. Our study considers 4 datasets spanning from image to text, 8 DNN architectures including feed-forward neural networks and recurrent neural networks, and 42 shifted sets with both synthetic and natural distribution shifts. The results reveal that 1) data with distribution shifts happen more disagreements than without. 2) Quantization-aware training can produce more stable models than standard, adversarial, and Mixup training. 3) Disagreements often have closer top-1 and top-2 output probabilities, and 𝑀𝑎𝑟𝑔𝑖𝑛 is a better indicator than the other uncertainty metrics to distinguish disagreements. 4) Retraining with disagreements has limited efficiency in removing disagreements. We opensource our code and models as a new benchmark for further studying the quantized models.

show abstract

“…To reduce the data exchange bandwidth and on-chip storage size requirements for CNN inference accelerators, one straightforward method is to compress the generated interlayer data simultaneously during inference. Many works reported quantizing the activation into low-precision fixed-point data to reduce the interlayer data size [11,25,26,54,55]. However, the lowprecision activation will significantly degrade the prediction accuracy of CNNs if the quantization is straightforwardly implemented on interlayer activations without retraining [27,28].…”

Section: Introductionmentioning

confidence: 99%

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Xie¹,

Shao²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

<p>Deep convolutional neural networks (CNNs) generate intensive inter-layer data during inference, which results in substantial on-chip memory size and off-chip bandwidth. To solve the memory constraint, this paper proposes an accelerator adopted with a compression technique that can reduce the inter-layer data by removing both intra- and inter-channel redundant information. Principal component analysis (PCA) is utilized in the compression process to concentrate inter-channel information. The spatial differences, truncation, and reconfigurable bit-width coding are implemented inside every feature map to eliminate the intra-channel data redundancy. Moreover, a particular data arrangement is introduced to enhance data continuity to optimize PCA analysis and improve compression performance. A CNN accelerator with the proposed compression technique is designed to support the on-the-fly compression process by pipelining the reconstruction, CNN computation, and compression operation. The prototype accelerator is implemented using 28-nm CMOS technology. It achieves 819.2GOPS peak throughput and 3.75TOPS/W energy efficiency with 218.5mW. Experiments show that the proposed compression technique achieves a compression ratio of 21.5%~43.0% (8-bit mode) and 9.8%~19.3% (16-bit mode) on state-of-the-art CNNs with a negligible accuracy loss. </p>

show abstract

Post-Training Sparsity-Aware Quantization

Cited by 4 publications

References 28 publications

Towards Real-Time Machine Learning-Based Signal/Background Selection in the CMS Detector Using Quantized Neural Networks and Input Data Reduction

Towards Real-Time Machine Learning-Based Signal/Background Selection in the CMS Detector Using Quantized Neural Networks and Input Data Reduction

Characterizing and Understanding the Behavior of Quantized Models for Reliable Deployment

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Contact Info

Product

Resources

About