RNA: An Accurate Residual Network Accelerator for Quantized and Reconstructed Deep Neural Networks

Luo, Cheng; Cao, Wei; Wang, Lingli; Leong, Philip H. W.

doi:10.1587/transinf.2018rcp0008

Cited by 11 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The rapid development of DNNs in recent years has led them to become a core enabler for a broad spectrum of application areas, such as computer vision, natural language processing, medical engineering, autonomous driving, and virtual reality. Meanwhile, there is a rising trend showing that the deployment of these applications is shifting from traditional cloud computing platforms (e.g., servers and supercomputers) to edge devices (e.g., mobile and handheld platforms) whose power efficiency is one of the major constraints [148][149][150][151][152][153][154]. Field Programmable Gate Arrays (FPGAs) and mobile devices, as the two most popular substrates in edge devices 1 , have established their dominance through the delivery of promising energy/power efficiency and performance.…”

Section: Introductionmentioning

confidence: 99%

“…On FPGAs, weight quantization is a natural fit. Besides storage reduction, the additional benefits include (1) the DSP on FPGA can support multiple multiply-and-accumulate (MAC) computations with appropriate weight (and activation) quantization, and (2) the look-up table (LUT) computing resources can support low-precision computing [150,[178][179][180][181][182][183][184][185]. Low-bit-width fixedpoint quantization is achieved in [179] through greedy solution to determine the radix position of each layer for quantization, and in [182] with a hybrid quantization scheme that allows different bit-widths for weights to provide more flexibility.…”

Section: Introductionmentioning

confidence: 99%

“…Binarized Neural Networks (BNNs) can be implemented with XNOR gates to execute multiplications [178,180,181], and replacing zero padding with odd-even padding enables it to realize a fully binarized neural network accelerator [181]. The implementation with P2 quantization has been explored in [150].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards efficient deep neural network inference and training for ubiquitous AI

Yuan

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards efficient deep neural network inference and training for ubiquitous AI

Yuan

View full text Add to dashboard Cite

show abstract

High-efficient MPSoC-based CNNs accelerator with optimized storage and dataflow

et al. 2021

View full text Add to dashboard Cite

The convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity, which involve enormous communication bandwidth and storage resources requirement. The computation requirement can be addressed effectively to achieve high throughput by highly parallel compute paradigms of current CNNs accelerators. But the energy consumption still remains high as communication can be more expensive than computation, especially for low power embedded platform. To address this problem, this paper proposes a CNNs accelerator based on a novel storage and dataflow on multi-processor system on chip (MPSoC) platform. By minimizing data access and movement and maximizing data reuse, it can achieve the energy efficient CNNs inference acceleration. The optimization strategies mainly involve four aspects. Firstly, an external memory sharing architecture adopting two-dimensional array storage mode for CPU-FPGA collaborative processing is proposed to achieve high data throughput and low bandwidth requirement for off-chip data transmission. Secondly, the minimized data access and movement on chip are realized by designing a multi-level hierarchical storage architecture. Thirdly, a cyclic data shifting method is proposed to achieve maximized data reuse based on both spatial and temporal. In addition, a bit fusion method based on the 8-bit dynamic fixed-point quantization is adopted to achieve double throughput and computational efficiency of a single DSP. The accelerator proposed in this paper is implemented on Zynq UltraScale + MPSoC ZCU102 evaluation board. By running the benchmark network of VGG16 and Tiny-YOLO on the accelerator, the throughput and the energy efficiency are evaluated. Compared with the current typical accelerators, the proposed accelerator can increase system throughput by up to 41x, single DSP throughput by up to 7.63x, and system energy efficiency by up to 6.3x.

show abstract