Early-Stage Neural Network Hardware Performance Analysis

Karbachevsky, Alex; Baskin, Chaim; Zheltonozhskii, Evgenii; Yermolin, Yevgeny; Gabbay, Freddy; Bronstein, Alex; Mendelson, Avi

doi:10.3390/su13020717

Cited by 15 publications

(5 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the quantization of CNN parameters leads to a reduction of power and area, it can also generate unexpected changes in the balance between communication and computation. Karbachevsky et al [33] studied the impact of CNN quantization on hardware implementation of computational resources. It combines the research conducted in Baskin et al [34] to propose a computation and communication analysis for quantized CNN.…”

Section: Related Workmentioning

confidence: 99%

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

et al. 2021

Self Cite

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are very popular in many fields including computer vision, speech recognition, natural language processing, etc. Though deep learning leads to groundbreaking performance in those domains, the networks used are very computationally demanding and are far from being able to perform in real-time applications even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices. To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. Yet, this acceleration comes at the cost of a larger error unless spatial adjustments are carried out. The method proposed in this work trains quantized neural networks by noise injection and a learned clamping, which improve accuracy. This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with as low as 3 bit weights and activations. We implement the proposed solution on an FPGA to demonstrate its applicability for low-power real-time applications. The quantization code will become publicly available upon acceptance.

show abstract

Section: Related Workmentioning

confidence: 99%

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…The topologic and hardware designs are based on multiple neuron processing and scalable computation. The neural network architecture can be implemented using a processing engine layout [ 34 ] for the hardware performance analysis framework for recognizing bottlenecks in the initial stages of a convolutional neural network (CNN). This methodology is useful for evaluating various architectures for embedded chips and associated applications like hardware accelerators.…”

Section: Related Workmentioning

confidence: 99%

Performance analysis of multiple input single layer neural network hardware chip

Goel

Kumar

2023

Multimed Tools Appl

View full text Add to dashboard Cite

“…Bit operations (BOPs) (Baskin et al, 2021) is another metric that aims to generalize floating-point operations (FLOPs) to heterogeneously quantized NNs. A hardware-aware complexity metric (HCM) (Karbachevsky et al, 2021) has also been proposed that aims to predict the impact of NN architectural decisions on the final hardware resources. Our work makes use of some of these metrics and further explores the connection and tradeoff between pruning and quantization.…”

Section: Efficiency Metricsmentioning

confidence: 99%

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Hawks

Duarte

Fraser

et al. 2021

Front. Artif. Intell.

View full text Add to dashboard Cite

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.

show abstract

Early-Stage Neural Network Hardware Performance Analysis

Cited by 15 publications

References 50 publications

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

Performance analysis of multiple input single layer neural network hardware chip

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Contact Info

Product

Resources

About