Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

Liu, Zhiqiang; Dou, Yong; Jiang, Jingfei; Xu, Jinwei; Li, Shijie; Zhou, Yongmei; Xu, Yingnan

doi:10.1145/3079758

Cited by 87 publications

(48 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Liu et al [170] proposed a parallel framework for FPGAbased CNN accelerators that exploits four levels of parallelism; task level, layer level, loop level, and operator level. Task-level parallelism involves executing multiple image prediction tasks simultaneously.…”

Section: Fpga-based Acceleratorsmentioning

confidence: 99%

“…We also like to acknowledge Dr. Blair P. Bremberg and Ms. Sumaiya Hussain Sadiq for their help in professional English editing of this manuscript. VOLUME 4, 2018 NeuFlow [143] Memory-Centric Accelerator [146] nn-X [148] Roofline-based FPGA Accelerator [55] Embedded FPGA Accelerator [98] DeepBurning [155] OpenCL-based FPGA Accelerator [80] Caffeine [153], [162] fpgaConvNet [165] Loop Unrolling [78], [168] Throughput-Optimized FPGA Accelerator [170] FP-DNN [171] FINN [181] Customized CONV Loop Accelerator [83] Latency-Driven Design for FPGA-based CNNs [183] DLA [188] Winograd-based CNN Accelerator [189] OpenCL-based Architecture for Accelerating CNNs [190] Multi-CLP Accelerator for CNNs [192] Automated Systolic Array Architecture for CNN [195] End-to-End Scalable FPGA Accelerator [196] DLAU [197] An Automatic RTL Compiler for High-Throughput Deep CNNs [199] Intel's DLA [200] Angel-Eye [60] Optimizing the CONV Operation to Accelerate DNNs on FPGA [204] Loop Unrolling El-Maleh's research interests are in the areas of synthesis, testing, and verification of digital systems. In addition, he has research interests in defect and soft-error tolerance design, VLSI design, design automation and efficient FPGA implementations of deep learning algorithms and data compression techniques.…”

Section: Acknowledgmentmentioning

confidence: 99%

See 1 more Smart Citation

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

2019

View full text Add to dashboard Cite

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.

show abstract

Section: Fpga-based Acceleratorsmentioning

confidence: 99%

Section: Acknowledgmentmentioning

confidence: 99%

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

2019

View full text Add to dashboard Cite

show abstract

“…DNN Accelerator Performance Prediction. For designing FPGA-based DNN accelerators, current practice usually relies on roofline models [10] or customized analytical tools [13,16] to estimate the achievable performance. For ASIC-based accelerators, recently published designs [21,34,35] introduce various performance prediction methods.…”

Section: Background and Related Workmentioning

confidence: 99%

AutoDNNchip

Zhang

Hao

et al. 2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the design space is large due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs that correspond to dramatically different performance/energy/area tradeoffs. Therefore, DNN chips often take months to years to design and require a large team of cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip − a DNN chip generator that can automatically generate both FPGA-and ASIC-based DNN chip implementation (i.e., synthesizable RTL code with optimized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset without humans in the loop. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, latency, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balance, etc.), optimize chip design via the Chip Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by <10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, both the FPGA-and ASIC-based DNN accelerators generated by our AutoDNNchip can achieve better (up to 3.86× improvement) performance than that of expert-crafted state-of-the-art accelerators, showing the effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

show abstract

“…Recently, fieldprogrammable gate arrays (FPGAs) have become a particularly attractive option for accelerating large-scale matrix multiplication due to their reconfigurability and abundant logic resources. Previous studies [3][4][5][6][7][8][9] have primarily focused on accelerating matrix multiplication on FPGA by using an efficient architecture, i.e. the one-dimensional systolic array.…”

Section: Introductionmentioning

confidence: 99%

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

Shen

Qiao

Huang

et al. 2018

2018 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Large-scale floating-point matrix multiplication is a fundamental kernel in many scientific and engineering applications. Most existing work only focus on accelerating matrix multiplication on FPGA by adopting a linear systolic array. This paper towards the extension of this architecture by proposing a scalable and highly configurable multi-array architecture. In addition, we propose a work-stealing scheme to ensure the equality in the workload partition among multiple linear arrays. Furthermore, an analytical model is developed to determine the optimal design parameters. Experiments on a real-life convolutional neural network (CNN) show that we can obtain the optimal extension of the linear array architecture.

show abstract

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

Cited by 87 publications

References 15 publications

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

AutoDNNchip

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs

Contact Info

Product

Resources

About