A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC

Meloni, Paolo; Deriu, Gianfranco; Conti, Francesco; Loi, Igor; Raffo, Luigi; Benini, Luca

doi:10.1109/reconfig.2016.7857144

Cited by 21 publications

(16 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Frequent execution of data caching and parameter loading will be limited by the bandwidth. Therefore, in many studies, their hardware structures of CNNs are designed mainly for the two bottlenecks of floating-point resources and bandwidth [12], [13], [16], [17].…”

Section: B Cnns Implemented By Fpgasmentioning

confidence: 99%

A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs

Chen

Liu

et al. 2021

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

The quantized neural network (QNN) is an efficient approach for network compression and can be widely used in the implementation of FPGAs. This paper proposes a novel learning framework for n-bit QNNs, whose weights are constrained to the power of two. To solve the gradient vanishing problem, we propose a reconstructed gradient function for QNNs in back-propagation algorithm that can directly get the real gradient rather than estimating an approximate gradient of the expected loss. We also propose a novel QNN structure named n-BQ-NN, which uses shift operation to replace the multiply operation and is more suitable for the inference on FPGAs. Furthermore, we also design a shift vector processing element (SVPE) array to replace all 16-bit multiplications with SHIFT operations in convolution operation on FPGAs. We also carry out comparable experiments to evaluate our framework. The experimental results show that the quantized models of ResNet, DenseNet and AlexNet through our learning framework can achieve almost the same accuracies with the original full-precision models. Moreover, when using our learning framework to train our n-BQ-NN from scratch, it can achieve state-of-the-art results compared with typical low-precision QNNs. Experiments on Xilinx ZCU102 platform show that our n-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference. As the SHIFT operation in our SVPE array will not consume Digital Signal Processings (DSPs) resources on FPGAs, the experiments have shown that the use of SVPE array also reduces average energy consumption to 68.7% of the VPE array with 16-bit. Index Terms-Deep learning, quantized neural network (QNN), deep compression, FPGA

show abstract

Section: B Cnns Implemented By Fpgasmentioning

confidence: 99%

A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs

Chen

Liu

et al. 2021

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…In detail, each of the DMACs serves as a bus master and accesses the DRAM subsystem through the on-chip bus. The hardware accelerator is assumed to be equipped with either one ( [1], [8,9]) or multiple DMACs ( [2], [5]- [7]) and each of the DMACs accesses the DRAM subsystem as a bus master. For example,…”

Section: System Under Considerationmentioning

confidence: 99%

“…After all, it follows that the processor core makes it possible to reconfigure the hardware accelerator according to the bank allocations and number of DMACs. Since a hardware accelerator is usually designed as a standalone IP block, a standardized interface may ease the integration into the system [1,2], [6]- [9]. The AMBA AXI4 interface, the standardized interface used in [27], is assumed in this work, as illustrated in Figure 1 (a).…”

Section: On-chip Off-chipmentioning

confidence: 99%

“…Hardware accelerators are often equipped with local memories controlled by direct memory access controllers (DMACs) [1]- [5]. Most of the DMA-controlled accelerators use the well-known double buffering to improve the performance by overlapping computation with communication [6]- [9]. If the hardware accelerator as a standalone IP is connected to off-chip memory (e.g., DRAM) through an on-chip bus (e.g., AXI crossbar), the communication bandwidth, which is defined as the number of data items per cycle transferred from/to accelerator, tends to be limited by either DRAM latency or bus protocol overhead [6]- [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimization of Communication Schemes for DMA-Controlled Accelerators

Wang

Park

2021

IEEE Access

View full text Add to dashboard Cite

The hardware accelerator controlled by direct memory access (DMA) is greatly influenced by the communication bandwidth from/to DRAM through on-chip buses. This paper proposes a novel performance estimation algorithm to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to facilitate the optimization of CSs, a communication primitive (CP) is defined by the bank allocation and the set of activated DMACs. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. It is shown that the optimized CS improves communication performance by up to 68% for the third convolutional layers of AlexNet and 60% for the MIMO of LDPCcoded MIMO-OFDM. In addition, the DRAM latency is minimized by setting the bank interleaving to the number of outstanding transactions. Moreover, the simulation results show that the optimum CS depends on the application. It is also shown that the use of an extra DMAC does not necessarily improve the communication performance.INDEX TERMS Convolutional neural networks, direct memory access, hardware accelerator, off-chip DRAM, on-chip communication, wireless communications

show abstract

“…For embedded systems that allow the FPGA accelerator itself [10] [11] to proactively fetch data from main into local FPGA memory, the state of the art is still copy-based shared memory. The main memory is statically split into two sections: one exclusively accessed by the host via cached, paged virtual addressing, and a second that is accessed by both the host and the FPGA via uncached, contiguous physical addressing.…”

Section: Introductionmentioning

confidence: 99%

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Vogel

Marongiu

Benini

2019

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development.In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading.

show abstract

A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC

Cited by 21 publications

References 15 publications

A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs

A Learning Framework for n-Bit Quantized Neural Networks Toward FPGAs

Optimization of Communication Schemes for DMA-Controlled Accelerators

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Contact Info

Product

Resources

About