A Reconfigurable Accelerator for Sparse Convolutional Neural Networks

You, Weijie; Wu, Chang

doi:10.1145/3289602.3293945

Cited by 6 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to Table V, our implementation achieves 223.4 GOP/s effective performance on sparse Alexnet which shows 2.4X speedup compared with [43] 5 . [49] shows similar performance to our design, but it applies low bit precision which requires less resources. The performance on VGG network is 309.0 GOP/s which is 3.6X-4.8X higher than [16,21].…”

Section: B Performance Analysismentioning

confidence: 62%

See 1 more Smart Citation

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Liang

Jin

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) have achieved remarkable performance at the cost of huge computation. As the CNN models become more complex and deeper, compressing CNNs to sparse by pruning the redundant connection in the networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. On the other hand, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA accelerators focus on dense CNN models which are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. In this work, we propose an accelerator with softwarehardware co-design for sparse CNNs on FPGAs. To efficiently deal with the irregular connections in the sparse convolutional layers, we propose a weight-oriented dataflow that exploits element-matrix multiplication as the key operation. Each weight is processed individually which yields low decoding overhead. Then we design an FPGA accelerator that features a tile lookup table (TLUT) and a channel multiplexer (CMUX). The tile look-up table is designed to match the index between sparse weights and input pixels. Using TLUT, the runtime decoding overhead is mitigated by using an efficient indexing operation. Moreover, we propose a weight layout to enable efficient on-chip memory access without conflicts. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address. Last, we build a Neural Architecture Search (NAS) engine that leverages the reconfigurability of FPGAs to generate an efficient CNN model and choose the optimal hardware design parameters. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 2.4X-12.9X speedup over previous dense CNN accelerators on FPGAs. Our FPGA-aware NAS approach shows 2X speedup over MobileNetV2 with 1.5% accuracy loss.

show abstract

Section: B Performance Analysismentioning

confidence: 62%

“…We also compare our design with previous FPGA accelerators in Table V. [16,21] are dense CNN accelerators and [43,48,49] are sparse CNN accelerators. The performance in Table V represents the effective performance.…”

Section: B Performance Analysismentioning

confidence: 99%

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

Liang

Jin

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…As CNN accelerators ( [9], [10], [35] ) are very popular ML accelerators these days, we compare our design with some state-of-the-art CNN accelerators [32]- [34], [36]. Table 3 shows the resources usage of the NBC accelerator is very limited.…”

Section: Methodsmentioning

confidence: 99%

A Real-Time Naive Bayes Classifier Accelerator on FPGA

Xue

Jiang²,

Guo

2020

IEEE Access

View full text Add to dashboard Cite

In this paper, we propose a real-time hardware naive Bayes classifier (NBC) which is implemented on field programmable gate array (FPGA). We first use logarithm transformation based look-up table and float-to-fixed point process to simplify the calculations in naive Bayes classification algorithm. The methods clear up the multiplication and division operations of floating points completely. Based the simplified algorithm, we design our hardware architecture which includes both training and inference part. A novel format of logarithm look-up table with very limited items and a shifter in it are working together to calculate the logarithm value of any number. There are several processing element (PE) arrays in the accelerator where each PE in an array is running in parallel, which speed up the classification process remarkably. The experiments prove that the proposed accelerator has much better real-time efficiency than the general processor, some hardware Bayes classifiers and convolutional neural network (CNN) accelerators. It outperforms the NBC and semi-NBC accelerators and costs far less resources on chip than many CNN accelerators. Its utilization of LUT, FF and BRAM is only 10%, 0.05% and 2% of CNN accelerators on average. The experimental results over five datasets of different magnitudes show the accelerator has almost no loss of classification accuracy comparing with ARM Cortex-A9 processor. Their deviation of the classification accuracy is only 0.39% on average. What's more, it improves the performance of the training phase and the inference phase about 7.9+e4 and 8.3+e4 on average, respectively.

show abstract

“…It does, however, require optimized implementations of sparse tensor operations [23] to translate the memory and computational savings to practical performance gains on commodity hardware. Nonetheless, many hardware accelerators for DNNs have been proposed with support for unstructured sparsity as their key design goal [24,25,26,27,28,29,30,31,32].…”

Section: Introductionmentioning

confidence: 99%

Online Weight Pruning Via Adaptive Sparsity Loss

Retsinas

Elafrou

Goumas

et al. 2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Pruning neural networks has regained interest in recent years as a means to compress state-of-the-art deep neural networks and enable their deployment on resource-constrained devices. In this paper, we propose a robust sparsity controlling framework that efficiently prunes network parameters during training with minimal computational overhead. We incorporate fast mechanisms to prune individual layers and build upon these to automatically prune the entire network under a user-defined budget constraint. Key to our end-to-end network pruning approach is the formulation of an intuitive and easy-toimplement adaptive sparsity loss used to explicitly control sparsity during training, enabling efficient budget-aware optimization.

show abstract

A Reconfigurable Accelerator for Sparse Convolutional Neural Networks

Cited by 6 publications

References 11 publications

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models

A Real-Time Naive Bayes Classifier Accelerator on FPGA

Online Weight Pruning Via Adaptive Sparsity Loss

Contact Info

Product

Resources

About