CoNNA – Compressed CNN Hardware Accelerator

Struharik, Rastislav; Vukobratović, Bogdan; Erdeljan, Andrea; Rakanovic, Damjan

doi:10.1109/dsd.2018.00070

Cited by 17 publications

(39 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another important aspect to consider is the number of parameters of many state-of-the-art CNNs, which can be in the order of millions (see Table VII as an example). CNNs can also require billions of computations to classify a single input instance [81,82]. Moreover, CNNs produce several intermediate feature maps, which must be stored in memory.…”

Section: F Discussion About Computational Complexitymentioning

confidence: 99%

See 1 more Smart Citation

Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics

Leyva

Sánchez

2019

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

In this paper, we propose a compact and lowcomplexity binary feature descriptor for video analytics. Our binary descriptor encodes the motion information of a spatiotemporal support region into a low-dimensional binary string. The descriptor is based on a binning strategy and a construction that binarizes separately the horizontal and vertical motion components of the spatio-temporal support region. We pair our descriptor with a novel Fisher Vector (FV) scheme for binary data to project a set of binary features into a fixed length vector in order to evaluate the similarity between feature sets. We test the effectiveness of our binary feature descriptor with FVs for action recognition, which is one of the most challenging tasks in computer vision, as well as gait recognition and animal behavior clustering. Several experiments on the KTH, UCF50, UCF101, CASIA-B, and TIGdog datasets show that the proposed binary feature descriptor outperforms the state-of-the-art feature descriptors in terms of computational time and memory and storage requirements. When paired with FVs, the proposed feature descriptor attains a very competitive performance, outperforming several state-of-the-art feature descriptors and some methods based on convolutional neural networks.

show abstract

Section: F Discussion About Computational Complexitymentioning

confidence: 99%

“…However, even if compression is applied, the original CNN must be first fully trained, which still demands memory to store intermediate feature maps and computational power to perform all operations. Moreover, pruning and weight quantization may affect the overall accuracy of the CNN [81].…”

Section: F Discussion About Computational Complexitymentioning

confidence: 99%

Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics

Leyva

Sánchez

2019

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

show abstract

“…There is also a CNN accelerator [11] with 8-bit precision in Table 4 while our main target for weight compression is 5-bit quantized weights. Though we have presented our technique for 5-bit weights, our arithmetic coding-based encoding and decoding technique can also be used along with 8-bit precision-based CNN accelerators.…”

Section: B Latency Overheadmentioning

confidence: 99%

“…• We introduce a lossless arithmetic coding-based 5-bit quantized weight compression technique; • We propose a hardware-based decoder for in-situ decompression of the compressed weights in the NPU or CNN accelerator, and also implement our hardwarebased decoder in field-programmable gate array (FPGA) as a proof-of-concept; • Our proposed technique for 5-bit quantized weights reduces the weight size by 9.6× (by up to 112.2× in the case of pruned weights) as compared to the case of using the uncompressed 32-bit floating-point (FP32) weights; • Our proposed technique for 5-bit quantized weights also reduces memory energy consumption by 89.2% (by up to 99.1% for pruned weights) as compared to the case of using the uncompressed FP32 weight; • When combining our compression technique and hardware decoder (16 decoding units) with various state-ofthe-art CNN accelerators [9] [10] [11], our technique incurs a small latency overhead by 0.16%-5.48% (0.16%-0.91% for pruned weights) as compared to the case without our proposed technique and hardware decoder. • When combining our proposed technique with various state-of-the-art CNN accelerators [9] [10], our proposed technique with 4 decoding unit (DU) decoder hardware reduces system-level energy consumption by 1.1%-9.3% as compared to the case without using our proposed technique.…”

Section: Introductionmentioning

confidence: 99%

Arithmetic Coding-Based 5-Bit Weight Encoding and Hardware Decoder for CNN Inference in Edge Devices

Lee

Kong²,

Munir

2021

IEEE Access

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have gained a huge attention for real-world artificial intelligence (AI) applications such as image classification and object detection. On the other hand, for better accuracy, the size of the CNNs' parameters (weights) has been increasing, which in turn makes it difficult to enable on-device CNN inferences in resource-constrained edge devices. Though weight pruning and 5bit quantization methods have shown promising results, it is still challenging to deploy large CNN models in edge devices. In this paper, we propose an encoding and hardware-based decoding technique which can be applied to 5-bit quantized weight data for on-device CNN inferences in resource-constrained edge devices. Given 5-bit quantized weight data, we employ arithmetic coding with range scaling for lossless weight compression, which is performed offline. When executing on-device inferences with underlying CNN accelerators, our hardware decoder enables a fast in-situ weight decompression with small latency overhead. According to our evaluation results with five widely used CNN models, our arithmetic codingbased encoding method applied to 5-bit quantized weights shows a better compression ratio by 9.6× while also reducing the memory data transfer energy consumption by 89.2%, on average, as compared to the case of uncompressed 32-bit floating-point weights. When applying our technique to pruned weights, we obtain better compression ratios by 57.5×-112.2× while reducing energy consumption by 98.3%-99.1% as compared to the case of 32-bit floating-point weights. In addition, by pipelining the weight decoding and transfer with the CNN execution, the latency overhead of our weight decoding with 16 decoding unit (DU) hardware is only 0.16%-5.48% and 0.16%-0.91% for non-pruned and pruned weights, respectively. Moreover, our proposed technique with 4-DU decoder hardware reduces system-level energy consumption by 1.1%-9.3%.

show abstract

“…Similar to Argus, SparseNN [17] and Cambricon-x [18] take advantage of skipping zeros in CNN weights. Beside mentioned, there are many other highquality architectures in terms of performance, like Eyeriss v2 [12], ENVISION [18], Thinker [19], UNPU [20], Snowflake [22], Caffeine [23], CoNNa [24], and architectures in [25]- [27].…”

Section: Introductionmentioning

confidence: 99%

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Rakanovic

Vranjkovic

Struharik

2021

ELEKTRON ELEKTROTECH

View full text Add to dashboard Cite

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

show abstract

CoNNA – Compressed CNN Hardware Accelerator

Cited by 17 publications

References 20 publications

Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics

Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics

Arithmetic Coding-Based 5-Bit Weight Encoding and Hardware Decoder for CNN Inference in Edge Devices

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Contact Info

Product

Resources

About