Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

Gong, Jianming; Shen, Haihao; Zhang, Guoming; Liu, Xiaoli; Li, Shane; Jin, Ge; Maheshwari, Niharika; Fomenko, Evarist; Segal, Eden

doi:10.1145/3229762.3229763

Cited by 28 publications

(13 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quantization can either be applied to weight values, inter-layer activation values, or both. Values can be quantized from the typically 4-byte floating-point values to 8-bit integers [47], or even more aggressively to ternary [48,49] or binary [50,51] values with varying accuracy loss trade-offs. Quantization is typically more readily available on embedded neural networks compared to pruning since pruning can introduce sparsity in the weights, resulting in complex random access [43,52].…”

Section: Neural Network Compressionmentioning

confidence: 99%

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

Lim

Jun

2022

Electronics

View full text Add to dashboard Cite

Although neural network quantization is an imperative technology for the computation and memory efficiency of embedded neural network accelerators, simple post-training quantization incurs unacceptable levels of accuracy degradation on some important models targeting embedded systems, such as MobileNets. While explicit quantization-aware training or re-training after quantization can often reclaim lost accuracy, this is not always possible or convenient. We present an alternative approach to compressing such difficult neural networks, using a novel variant of the ZFP lossy floating-point compression algorithm to compress both model weights and inter-layer activations and demonstrate that it can be efficiently implemented on an embedded FPGA platform. Our ZFP variant, which we call ZFPe, is designed for efficient implementation on embedded accelerators, such as FPGAs, requiring a fraction of chip resources per bandwidth compared to state-of-the-art lossy compression accelerators. ZFPe-compressing the MobileNet V2 model with an 8-bit budget per weight and activation results in significantly higher accuracy compared to 8-bit integer post-training quantization and shows no loss of accuracy, compared to an uncompressed model when given a 12-bit budget per floating-point value. To demonstrate the benefits of our approach, we implement an embedded neural network accelerator on a realistic embedded acceleration platform equipped with the low-power Lattice ECP5-85F FPGA and a 32 MB SDRAM chip. Each ZFPe module consumes less than 6% of LUTs while compressing or decompressing one value per cycle, requiring a fraction of the resources compared to state-of-the-art compression accelerators while completely removing the memory bottleneck of our accelerator.

show abstract

Section: Neural Network Compressionmentioning

confidence: 99%

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

Lim

Jun

2022

Electronics

View full text Add to dashboard Cite

show abstract

“…Similar to [26], [16] also fuses BN with the weights, taking the approach further by fusing BN with the biases, if they are used. A BN-fused bias 𝑏 𝐵𝑁 is computed as…”

Section: Integer/fixed Point Quantizationmentioning

confidence: 99%

“…A major difference with [26], however, is that [16] is a PTQ technique. They use calibration data to compute the quantization scaling factors for the weights, activations, and biases.…”

Section: Integer/fixed Point Quantizationmentioning

confidence: 99%

“…To further correct the bias introduced by quantization, they run knowledge distillation using their calibration data. They also note that the common practice of fusing the BN layers with their predecessor weight layers (Equation (2)) before applying PTQ is problematic, as seen in [16]. Before quantization, the BN parameters are reflecting the internal statistics of the floating point model, not the quantized model.…”

Section: Mixed Precision Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

Neural Network Quantization for Efficient Inference: A Survey

Weng¹

2021

Preprint

View full text Add to dashboard Cite

As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.

show abstract

“…Migacz (2017) proposed to use a small calibration set to gather activation statistics and then randomly search for a quantized distribution minimizing the Kullback-Leibler divergence to the continuous one. Gong et al (2018), on the other hand, just uses the L ∞ norm of the tensor as a threshold. Lee et al (2018) employed channel-wise quantization and constructed a dataset of pre-determined parametric probability densities with their respective quantized versions; a simple classifier was trained to select the best fitting density.…”

Section: Related Workmentioning

confidence: 99%

Feature Map Transform Coding for Energy-Efficient CNN Inference

Chmiel,

Baskin,

Banner

et al. 2019

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a variety of tasks in computer vision and beyond. One of the major obstacles hindering the ubiquitous use of CNNs for inference on low-power edge devices is their relatively high computational complexity and memory bandwidth requirements. The latter often dominates the energy footprint on modern hardware. In this paper, we introduce a lossy transform coding approach, inspired by image and video compression, designed to reduce the memory bandwidth due to the storage of intermediate activation calculation results. Our method exploits the high correlations between feature maps and adjacent pixels and allows to halve the data transfer volumes to the main memory without re-training. We analyze the performance of our approach on a variety of CNN architectures and demonstrated FPGA implementation of ResNet18 with our approach results in reduction of around 40% in the memory energy footprint compared to quantized network with negligible impact on accuracy. A reference implementation accompanies the paper.Preprint. Under review.

show abstract

Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

Cited by 28 publications

References 5 publications

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

MobileNets Can Be Lossily Compressed: Neural Network Compression for Embedded Accelerators

Neural Network Quantization for Efficient Inference: A Survey

Feature Map Transform Coding for Energy-Efficient CNN Inference

Contact Info

Product

Resources

About