Design Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network in Embedded FPGA

Wang, Junsong; Lou, Qiuwen; Zhang, Xiaofan; Zhu, Chao; Lin, Yu; Chen, Deming

doi:10.1109/fpl.2018.00035

Cited by 90 publications

(62 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our design achieves a per-image latency of 2.28 ms, which is among the lowest across all the designs. In addition, compared with some of the most recent works [40,47], our design outperforms them by 5.64× and 3.26×, respectively, in term of energy efficiency. Additionally, compared to an implementation which achieves comparable low latency [29], our implementation has 9.29x higher energy efficiency.…”

Section: Comparing To Prior Fpga Acceleratorsmentioning

confidence: 73%

“…Our model in Table 2 is significantly smaller and all weights (including weights in batch normalization layers) are quantized to power of two numbers. Our accuracy is 50.84% (about 2% worse than nearest competitive designs [40] in terms of energy efficiency). However, our implementation has at least 3x higher energy efficiency.…”

Section: Comparing To Prior Fpga Acceleratorsmentioning

confidence: 76%

See 1 more Smart Citation

Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation

McDanel

Zhang

Kung

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. As a validation vehicle, we have implemented a 170MHz FPGA inference chip achieving 2.28ms latency for the ImageNet benchmark. The achieved latency is among the lowest reported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. A highlight of our full-stack approach which attributes to the achieved high energy efficiency is an efficient Selector-Accumulator (SAC) architecture for implementing the multiplier-accumulator (MAC) operation present in any digital CNN hardware. For instance, compared to a FPGA implementation for a traditional 8-bit MAC, SAC substantially reduces required hardware resources (4.85x fewer Look-up Tables) and power consumption (2.48x).

show abstract

Section: Comparing To Prior Fpga Acceleratorsmentioning

confidence: 73%

Section: Comparing To Prior Fpga Acceleratorsmentioning

confidence: 76%

Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation

McDanel

Zhang

Kung

et al. 2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…In [49], the optimized solution of a network is chosen layer by layer to avoid an exponential design space exploration. Wang et al [64] try to use large bit-width for only the rst and last layer and quantize the middle layers to ternary or binary. e method needs to increase the network size to keep high accuracy but still brings hardware performance improvement.…”

Section: Data Antizationmentioning

confidence: 99%

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

Guo

Zeng

et al. 2019

ACM Trans. Reconfigurable Technol. Syst.

220

141

View full text Add to dashboard Cite

Recent researches on neural network have shown signi cant advantage in machine learning over traditional algorithms based on handcra ed features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great di culty on its application. CPU platforms are hard to o er enough computation capacity. GPU platforms are the rst choice for neural network process because of its high computation capacity and easy to use development frameworks.On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With speci cally designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy eciency. Various FPGA-based accelerator designs have been proposed with so ware and hardware optimization techniques to achieve high speed and energy e ciency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from so ware to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work. K. Guo et al.But the computation and storage complexity of NN models are high. In Table 1, we list the number of operations, number of parameters (add or multiplication), and top-1 accuracy on ImageNet dataset [50] of state-of-the-art CNN models. Take CNN as an example. e largest CNN model for a 224 × 224 image classi cation requires up to 39 billion oating point operations (FLOP) and more than 500MB model parameters [56]. As the computation complexity is proportional to the input image size, processing images with higher resolutions may need more than 100 billion operations. Latest work like MobileNet [24] and Shu eNet [79] are trying to reduce the network size with advanced network structures, but with obvious accuracy loss. e balance between the size of NN models and accuracy is still an open question today. In some cases, the large model size hinders the application of NN, especially in power limited or latency critical scenarios. erefore, choosing a proper computation platform for neural-network-based applications is essential. A typical CPU can perform 10-100G FLOP per second, and the power e ciency is usually below 1GOP/J. So CPUs are hard to meet the high performance requirements in cloud applications nor the low power requirements in mobile applications. In contrast, GPUs o er up to 10TOP/s peak performance and are good choices for high performance neural network applications. Development frameworks like Ca e [26] and Tensor ow [4] also o er easy-to-use interfaces which makes GPU the rst choice of neural network acceleration.Besides CPUs and GPUs, FPGAs are becoming a platform candidate to achieve energy e cient neural network processing. With a neural network oriented hardware design, FPGAs can implement high parallelism and make use of the pro...

show abstract

“…The whole task requires designers to have a deep understanding of both DNN algorithms and hardware design. In response to the intense demands and challenges of designing DNN accelerators, we have seen rapid development of high-level synthesis (HLS) design flow [22][23][24][25] and DNN design automation frameworks [16,[26][27][28][29][30] that improve the hardware design efficiency by allowing DNN accelerator design from high-level algorithmic descriptions and using pre-defined high-quality hardware IPs. Still, they either rely on hardware experts to trim down the large design space (e.g., use pre-defined/fixed architecture templates and explore other factors [16,29]) or conduct merely limited design exploration and optimization, hindering the development of optimal DNN accelerators that can be deployed into various platforms.…”

Section: Introductionmentioning

confidence: 99%

AutoDNNchip

Zhang

Hao

et al. 2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the design space is large due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs that correspond to dramatically different performance/energy/area tradeoffs. Therefore, DNN chips often take months to years to design and require a large team of cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip − a DNN chip generator that can automatically generate both FPGA-and ASIC-based DNN chip implementation (i.e., synthesizable RTL code with optimized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset without humans in the loop. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, latency, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balance, etc.), optimize chip design via the Chip Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by <10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, both the FPGA-and ASIC-based DNN accelerators generated by our AutoDNNchip can achieve better (up to 3.86× improvement) performance than that of expert-crafted state-of-the-art accelerators, showing the effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

show abstract

Design Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network in Embedded FPGA

Cited by 90 publications

References 16 publications

Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation

Full-stack optimization for accelerating CNNs using powers-of-two weights with FPGA validation

[DL] A Survey of FPGA-based Neural Network Inference Accelerators

AutoDNNchip

Contact Info

Product

Resources

About