Comparative Analysis of Processor-FPGA Communication Performance in Low-Cost FPSoCs

2022

In this paper, a new pre-RTL simulator is proposed to predict the power, performance, and area of convolutional neural network (CNN) dataflows prior to register-transfer-level (RTL) design. In the simulator, a novel approach is adopted to implement a spatial data dependence graph (SDDG), which enables us to model a specific dataflow alongside inter-instruction dependencies by tracking the status of each processing element (PE). In addition, the proposed pre-RTL simulator makes it possible to evaluate the impact of memory constraints such as latency and bandwidth. The latency-insensitive and bandwidth-insensitive PE controllers assumed in the proposed pre-RTL simulator guarantee both operational correctness and maximum performance, regardless of memory constraints. In particular, it is shown that the optimal distribution method of local memory bandwidth can reduce the accelerator performance by up to 37.6% compared with the equal distribution method. For weight stationary (WS) and row stationary (RS) dataflows, the accelerator performance closely depends on memory constraints. The simulation results also show that the relative performances of dataflows depend on the layer shape of the convolutional layer. For example, for an identical hardware area in a standard convolutional layer of AlexNet, WS dataflows do not provide any performance gain over RS dataflows when the memory latency is sufficiently high. In addition, WS dataflows cannot fully reuse the input activation, thereby increasing local memory accesses, since the number of weights loaded at a specific time is limited. Moreover, in a depth-wise convolutional layer of MobileNet, WS dataflows tend to outperform RS dataflows even in the presence of large memory latency.

Section: System Under Considerationmentioning

confidence: 99%

Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network Dataflows

Wang

2022

“…It is assumed that the whole computation-intensive task (e.g., the convolutional layer in the case of CNN) is offloaded onto the accelerator and that the processor core is responsible for synchronizing each of the DMACs, as assumed in many other studies, e.g., [30][31][32]. More specifically, the processor can start and stop the DMAC execution appropriately by reading from or writing to the control registers of the accelerator, for example, through the AMBA AXI interface.…”

Section: System Under Considerationmentioning

confidence: 99%

“…If the communication time is longer than the computation time (as assumed in the figure), it is referred to as communication-limited. In the communication-limited case, the performance of a DMAcontrolled accelerator tends to be determined by the communication bandwidth, which is in turn determined by DRAM latency and bus protocol overhead.It is assumed that the whole computation-intensive task (e.g., the convolutional layer in the case of CNN) is offloaded onto the accelerator and that the processor core is responsible for synchronizing each of the DMACs, as assumed in many other studies, e.g.,[30][31][32]. More specifically, the processor can start and stop the DMAC execution appropriately by reading from or writing to the control registers of the accelerator, for example, through the AMBA AXI interface.…”

mentioning

confidence: 99%

System-Level Communication Performance Estimation for DMA-Controlled Accelerators

Kim

2021

The performance of a hardware accelerator is often limited by the communication bandwidth between local on-chip memories and DRAM across on-chip bus. In this paper, a system-level performance estimation algorithm is newly proposed for evaluating the communication performance of direct memory access (DMA) controlled accelerators. The proposed algorithm can estimate the communication performance accurately for both DRAM-limited and bus-limited cases. In detail, the communication performance

“…Processor core In addition, the processor core is responsible for synchronizing the DMACs in the hardware accelerator, i.e., starting and stopping the DMAC execution appropriately [24][25][26]. Recalling that the processor core is also responsible for setting the source/destination addresses of each DMAC, the bank allocations can also be reconfigured by letting the processor core allocate the DMAC a set of banks.…”

Section: On-chip Off-chipmentioning

confidence: 99%

Optimization of Communication Schemes for DMA-Controlled Accelerators

Wang

2021

The hardware accelerator controlled by direct memory access (DMA) is greatly influenced by the communication bandwidth from/to DRAM through on-chip buses. This paper proposes a novel performance estimation algorithm to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to facilitate the optimization of CSs, a communication primitive (CP) is defined by the bank allocation and the set of activated DMACs. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. It is shown that the optimized CS improves communication performance by up to 68% for the third convolutional layers of AlexNet and 60% for the MIMO of LDPCcoded MIMO-OFDM. In addition, the DRAM latency is minimized by setting the bank interleaving to the number of outstanding transactions. Moreover, the simulation results show that the optimum CS depends on the application. It is also shown that the use of an extra DMAC does not necessarily improve the communication performance.INDEX TERMS Convolutional neural networks, direct memory access, hardware accelerator, off-chip DRAM, on-chip communication, wireless communications