CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array

Son, HyunWook; Na, Yong-Seok; Kim, TaeHyun; Al-Hamid, Ali A.; Kim, HyungWon

doi:10.1109/isocc53507.2021.9613997

Cited by 9 publications

(8 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A Hybrid Precision FP MAC (HP-MAC) unit is presented as an example implementation of an accelerator for YOLOv2-Tiny, which consists of 3 × 3 convolution kernels in all nine layers based on the diagonal cyclic array proposed by [40]. The input activations are propagated horizontally through each row, while weight parameters are propagated vertically through each column of the 3 × 3 array, as illustrated in Figure 10.…”

Section: Hpfp Multiplication and Accumulation (Hpfp Mac)mentioning

confidence: 99%

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

Junaid,

Aliev,

Park

et al. 2024

Sensors

Self Cite

View full text Add to dashboard Cite

The rapid advancement in AI requires efficient accelerators for training on edge devices, which often face challenges related to the high hardware costs of floating-point arithmetic operations. To tackle these problems, efficient floating-point formats inspired by block floating-point (BFP), such as Microsoft Floating Point (MSFP) and FlexBlock (FB), are emerging. However, they have limited dynamic range and precision for the smaller magnitude values within a block due to the shared exponent. This limits the BFP’s ability to train deep neural networks (DNNs) with diverse datasets. This paper introduces the hybrid precision (HPFP) selection algorithms, designed to systematically reduce precision and implement hybrid precision strategies, thereby balancing layer-wise arithmetic operations and data path precision to address the shortcomings of traditional floating-point formats. Reducing the data bit width with HPFP allows more read/write operations from memory per cycle, thereby decreasing off-chip data access and the size of on-chip memories. Unlike traditional reduced precision formats that use BFP for calculating partial sums and accumulating those partial sums in 32-bit Floating Point (FP32), HPFP leads to significant hardware savings by performing all multiply and accumulate operations in reduced floating-point format. For evaluation, two training accelerators for the YOLOv2-Tiny model were developed, employing distinct mixed precision strategies, and their performance was benchmarked against an accelerator utilizing a conventional brain floating point of 16 bits (Bfloat16). The HPFP selection, employing 10 bits for the data path of all layers and for the arithmetic of layers requiring low precision, along with 12 bits for layers requiring higher precision, results in a 49.4% reduction in energy consumption and a 37.5% decrease in memory access. This is achieved with only a marginal mean Average Precision (mAP) degradation of 0.8% when compared to an accelerator based on Bfloat16. This comparison demonstrates that the proposed accelerator based on HPFP can be an efficient approach to designing compact and low-power accelerators without sacrificing accuracy.

show abstract

Section: Hpfp Multiplication and Accumulation (Hpfp Mac)mentioning

confidence: 99%

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

Junaid,

Aliev,

Park

et al. 2024

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…Including state machine control, register configuration, and address updates during continuous computing. In addition to the convolution layer, the operation core also supports activation and pooling, and the three functional modules are cascaded [5] .…”

Section: Convolutional Layer Operation Analysismentioning

confidence: 99%

Design and verification of convolutional neural network accelerator

Xiang

Sui

Zhang

2023

International Conference on Signal Processing, Computer Networks, and Communications (SPCNC 2022)

View full text Add to dashboard Cite

The existing software implementation schemes of Convolutional Neural Networks (CNN) cannot meet the requirements of computing performance and power consumption. To further improve the energy efficiency of the deep neural network, improve throughput and reduce power consumption, a hardware accelerator based on a convolutional neural network was designed, and a verification platform was built for it. The platform has good reusability and can flexibly complete the verification work of the target chip under various modes and configurations, and perform performance evaluation and functional correctness verification on the chip. Through the board-level verification results, this design reduces power consumption by 12.36% compared with similar accelerators and improves hardware resource utilization by 13.87% while keeping the same conditions as clock frequency and bus bit width.

show abstract

“…Next, we present two experiments to compare the inference computation times of quantized models from the ONNX Run-Time dynamic and the proposed method. The first experiment is based on actual inference on a GPU-based PC, while the second experiment is based on the estimation of computation time on an NPU architecture [48,49]. In the first experiment, the quantized YOLOv5 models are tested using a GPU-based PC with an Intel(R) Core (TM) i7-9700 CPU @ 3.00GHz, GPU NVIDIA GeForce RTX 2060 (6GB).…”

Section: Cnn Model Number Of Parameterized Layersmentioning

confidence: 99%

“…Table 2 demonstrates that the proposed method offers a substantially higher speed improvement for deeper and more complex CNNs. In the second experiment, we estimated the computational time based on the NPU architecture simulator reported in [48,49] using YOLOv5-n (3) model, as shown in Table 3. Table 3 compares the NPU's estimated inference time for ONNX Run-Time dynamic and USPIQ.…”

Section: Cnn Model Number Of Parameterized Layersmentioning

confidence: 99%

“…To evaluate the energy savings of USPIQ over ONNX Run-Time, we estimated the energy consumption of the quantized model of YOLOv5-n (3), as shown in Table 4. Energy consumption was estimated by the NPU simulator [48,49] using physical design kit data for a 45 nm CMOS process [12], which provides accurate energy estimation of individual operators in the NPU. For the convolution operations, USPIQ achieves an energy savings of 80.43% over the ONNX Run-Time Static since USPIQ uses only pureinteger operations for convolution, whereas ONNX Run-Time static requires 32-bit floatingpoint calculations.…”

Section: Cnn Model Number Of Parameterized Layersmentioning

confidence: 99%

See 1 more Smart Citation

Unified Scaling-Based Pure-Integer Quantization for Low-Power Accelerator of Complex CNNs

Al-Hamid

Kim

2023

Electronics

Self Cite

View full text Add to dashboard Cite

Although optimizing deep neural networks is becoming crucial for deploying the networks on edge AI devices, it faces increasing challenges due to scarce hardware resources in modern IoT and mobile devices. This study proposes a quantization method that can quantize all internal computations and parameters in the memory modification. Unlike most previous methods that primarily focused on relatively simple CNN models for image classification, the proposed method, Unified Scaling-Based Pure-Integer Quantization (USPIQ), can handle more complex CNN models for object detection. USPIQ aims to provide a systematic approach to convert all floating-point operations to pure-integer operations in every model layer. It can significantly reduce the computational overhead and make it more suitable for low-power neural network accelerator hardware consisting of pure-integer datapaths and small memory aimed at low-power consumption and small chip size. The proposed method optimally calibrates the scale parameters for each layer using a subset of unlabeled representative images. Furthermore, we introduce a notion of the Unified Scale Factor (USF), which combines the conventional two-step scaling processes (quantization and dequantization) into a single process for each layer. As a result, it improves the inference speed and the accuracy of the resulting quantized model. Our experiment on YOLOv5 models demonstrates that USPIQ can significantly reduce the on-chip memory for parameters and activation data by ~75% and 43.68%, respectively, compared with the floating-point model. These reductions have been achieved with a minimal loss in mAP@0.5—at most 0.61%. In addition, our proposed USPIQ exhibits a significant improvement in the inference speed compared to ONNX Run-Time quantization, achieving a speedup of 1.64 to 2.84 times. We also demonstrate that USPIQ outperforms the previous methods in terms of accuracy and hardware reduction for 8-bit quantization of all YOLOv5 versions.

show abstract

CNN Accelerator with Minimal On-Chip Memory Based on Hierarchical Array

Cited by 9 publications

References 2 publications

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

Design and verification of convolutional neural network accelerator

Unified Scaling-Based Pure-Integer Quantization for Low-Power Accelerator of Complex CNNs

Contact Info

Product

Resources

About