Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-Designing Pareto-Efficient Deep Learning 2018
DOI: 10.1145/3229762.3229763
|View full text |Cite
|
Sign up to set email alerts
|

Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

Abstract: High throughput and low latency inference of deep neural networks are critical for the deployment of deep learning applications. This paper presents the efficient inference techniques of IntelCaffe, the first Intel ® optimized deep learning framework that supports efficient 8-bit low precision inference and model optimization techniques of convolutional neural networks on Intel ® Xeon ® Scalable Processors. The 8-bit optimized model is automatically generated with a calibration process from FP32 model without … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(13 citation statements)
references
References 5 publications
0
13
0
Order By: Relevance
“…Quantization can either be applied to weight values, inter-layer activation values, or both. Values can be quantized from the typically 4-byte floating-point values to 8-bit integers [47], or even more aggressively to ternary [48,49] or binary [50,51] values with varying accuracy loss trade-offs. Quantization is typically more readily available on embedded neural networks compared to pruning since pruning can introduce sparsity in the weights, resulting in complex random access [43,52].…”
Section: Neural Network Compressionmentioning
confidence: 99%
“…Quantization can either be applied to weight values, inter-layer activation values, or both. Values can be quantized from the typically 4-byte floating-point values to 8-bit integers [47], or even more aggressively to ternary [48,49] or binary [50,51] values with varying accuracy loss trade-offs. Quantization is typically more readily available on embedded neural networks compared to pruning since pruning can introduce sparsity in the weights, resulting in complex random access [43,52].…”
Section: Neural Network Compressionmentioning
confidence: 99%
“…Similar to [26], [16] also fuses BN with the weights, taking the approach further by fusing BN with the biases, if they are used. A BN-fused bias 𝑏 𝐵𝑁 is computed as…”
Section: Integer/fixed Point Quantizationmentioning
confidence: 99%
“…A major difference with [26], however, is that [16] is a PTQ technique. They use calibration data to compute the quantization scaling factors for the weights, activations, and biases.…”
Section: Integer/fixed Point Quantizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Migacz (2017) proposed to use a small calibration set to gather activation statistics and then randomly search for a quantized distribution minimizing the Kullback-Leibler divergence to the continuous one. Gong et al (2018), on the other hand, just uses the L ∞ norm of the tensor as a threshold. Lee et al (2018) employed channel-wise quantization and constructed a dataset of pre-determined parametric probability densities with their respective quantized versions; a simple classifier was trained to select the best fitting density.…”
Section: Related Workmentioning
confidence: 99%