Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Hawks, Benjamin; Duarte, J.; Fraser, Nicholas J.; Pappalardo, Alessandro; Tran, N. V.; Umuroglu, Yaman

doi:10.3389/frai.2021.676564

Cited by 29 publications

(30 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, math-intensive tensor operations executed on INT8 types can see up to a 16× speed-up compared to the same operations in FP32. Moreover, memory-limited operations could see up to a 4× speed-up compared to the FP32 version [22][23][24]41 . Therefore, in addition to pruning, we will reduce the precision of the weights and activations to further decrease the computational complexity of the equalizer, employing the technique known as integer quantization 41 .…”

Section: Quantization Techniquementioning

confidence: 99%

“…The quantization process can occur after the training or during it. The first case is known as post-training quantization (PTQ) and the second one is the quantization aware training [22][23][24] . In PTQ, a trained model has its weight and activations quantified.…”

Section: Quantization Techniquementioning

confidence: 99%

“…Traditionally, the network complexity arithmetic has been measured using the number of MAC operations. However, in terms of the DSP processing, the number BoPs is a more appropriate metric to describe the computational complexity of the model 22,45 , as different data types for the network weights and activations are used. Thus, in this work, we use BoPs to quantify the NN equalizer complexity.…”

Section: Computational Complexity Metricsmentioning

confidence: 99%

“…Thus, the use of NN-based methods, while being, undoubtedly, promising and attractive, faces a major challenge in optical channel equalization, where the computational complexity emerges as an important limiting real-time deployment factor 10,12,20,21 . We note here that it is, of course, well known that some NN architectures can be simplified without significantly affecting their performance, thanks, e.g., to strategies such as pruning and quantization 19,20,[22][23][24][25] . However, their application in the experimental environment of the resource-restricted hardware has not been yet fully studied in the context of coherent optical channel equalization.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Experimental Implementation of a Neural Network Optical Channel Equalizer in Restricted Hardware Using Pruning and Quantization

Costa

Sourza

Prilepsky

et al. 2022

Preprint

View full text Add to dashboard Cite

The deployment of artificial neural networks-based optical channel equalizers on edge-computing devices is critically important for the next generation of optical communication systems. However, this is a highly challenging problem, mainly due to the computational complexity of the artificial neural networks (NNs) required for the efficient equalization of nonlinear optical channels with large memory. To implement the NN-based optical channel equalizer in hardware, a substantial complexity reduction is needed, while keeping an acceptable performance level. In this work, we address this problem by applying pruning and quantization techniques to an NN-based optical channel equalizer. We use an exemplary NN architecture, the multi-layer perceptron (MLP), and address its complexity reduction for the 30 GBd 1000 km transmission over a standard single-mode fiber. We demonstrate that it is feasible to reduce the equalizer’s memory by up to 87.12%, and its complexity by up to 91.5%, without noticeable performance degradation. In addition to this, we accurately define the computational complexity of a compressed NN-based equalizer in the digital signal processing (DSP) sense and examine the impact of using different CPU and GPU settings on power consumption and latency for the compressed equalizer. We also verify the developed technique experimentally, using two standard edge-computing hardware units: Raspberry Pi 4 and Nvidia Jetson Nano.

show abstract

Section: Quantization Techniquementioning

confidence: 99%

Section: Quantization Techniquementioning

confidence: 99%

Section: Computational Complexity Metricsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Experimental Implementation of a Neural Network Optical Channel Equalizer in Restricted Hardware Using Pruning and Quantization

Costa

Sourza

Prilepsky

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…4. Quantization-aware training [55,[59][60][61][62][63][64][65][66][67][68] using QKeras [56,69] or BreviTas [50,70], parameter pruning [71][72][73][74][75][76], and general hardware-algorithm codesign can significantly reduce the necessary FPGA resources by reducing the required bit precision and removing irrelevant operations. 5.…”

Section: Inference Timingmentioning

confidence: 99%

Charged Particle Tracking via Edge-Classifying Interaction Networks

Dezoort

Thais

Duarte

et al. 2021

Comput Softw Big Sci

Self Cite

View full text Add to dashboard Cite

Recent work has demonstrated that geometric deep learning methods such as graph neural networks (GNNs) are well suited to address a variety of reconstruction problems in high-energy particle physics. In particular, particle tracking data are naturally represented as a graph by identifying silicon tracker hits as nodes and particle trajectories as edges, given a set of hypothesized edges, edge-classifying GNNs identify those corresponding to real particle trajectories. In this work, we adapt the physics-motivated interaction network (IN) GNN toward the problem of particle tracking in pileup conditions similar to those expected at the high-luminosity Large Hadron Collider. Assuming idealized hit filtering at various particle momenta thresholds, we demonstrate the IN’s excellent edge-classification accuracy and tracking efficiency through a suite of measurements at each stage of GNN-based tracking: graph construction, edge classification, and track building. The proposed IN architecture is substantially smaller than previously studied GNN tracking architectures; this is particularly promising as a reduction in size is critical for enabling GNN-based tracking in constrained computing environments. Furthermore, the IN may be represented as either a set of explicit matrix operations or a message passing GNN. Efforts are underway to accelerate each representation via heterogeneous computing resources towards both high-level and low-latency triggering applications.

show abstract

Memory optimization at Edge for Distributed Convolution Neural Network

Naveen

Kounte

2022

Trans Emerging Tel Tech

View full text Add to dashboard Cite

Internet of Things (IoT) edge intelligence has emerged by optimizing the deep learning (DL) models deployed on resource‐constraint devices for quick decision‐making. In addition, edge intelligence reduces network overload and latency by bringing intelligent analytics closer to the source. On the other hand, DL models need a lot of computing resources. As a result, they have high computational workloads and memory footprint, making it impractical to deploy and execute on IoT edge devices with limited capabilities. In addition, existing layer‐based partitioning methods generate many intermediate results, resulting in a huge memory footprint. In this article, we propose a framework to provide a comprehensive solution that enables the deployment of convolutional neural networks (CNNs) onto distributed IoT devices for faster inference and reduced memory footprint. This framework considers a pretrained YOLOv2 model, and a weight pruning technique is applied to the pre‐trained model to reduce the number of non‐contributing parameters. We use the fused layer partitioning method to vertically partition the fused layers of the CNN and then distribute the partition among the edge devices to process the input. In our experiment, we have considered multiple Raspberry Pi as edge devices. Raspberry Pi with a neural computing stick is a gateway device to combine the results from various edge devices and get the final output. Our proposed model achieved inference latency of 5 to ∼$$ \sim $$7 seconds for 3prefix×3$$ 3\times 3 $$ to 5prefix×5$$ 5\times 5 $$ fused layer partitioning for five devices with a 9% improvement in memory footprint.

show abstract

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Cited by 29 publications

References 28 publications

Experimental Implementation of a Neural Network Optical Channel Equalizer in Restricted Hardware Using Pruning and Quantization

Experimental Implementation of a Neural Network Optical Channel Equalizer in Restricted Hardware Using Pruning and Quantization

Charged Particle Tracking via Edge-Classifying Interaction Networks

Memory optimization at Edge for Distributed Convolution Neural Network

Contact Info

Product

Resources

About