Scalable high-performance architecture for convolutional ternary neural networks on FPGA

Prost-Boucle, Adrien; Bourge, Alban; Pétrot, Frédéric; Alemdar, Hande; Caldwell, Nicholas; Leroy, Vincent

doi:10.23919/fpl.2017.8056850

Cited by 61 publications

(54 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Qiu et al [22] proposed a CNN accelerator supporting 8 and 4-bit data, implemented on a Xilinx Zynq platform. On this trail, even extreme quantization approaches have been presented, exploiting ternary or binary networks [23], [24]. While most DSP-capable FPGAs currently do not offer a low enough power envelope to be used in IoT end-nodes, Lattice recently announced SenseAI class of FPGAs [25] providing a comprehensive hardware and software solutions for always-on artificial intelligence (AI) within a power budget between 1 mW and 1 W. However these ultra-low power FPGAs are currently too expensive for many applications where MCUs are traditionally chosen because of their low cost.…”

Section: Related Workmentioning

confidence: 99%

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

Garofalo

Rusci

Conti

et al. 2019

Phil. Trans. R. Soc. A.

106

139

View full text Add to dashboard Cite

We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster’s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

show abstract

Section: Related Workmentioning

confidence: 99%

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

Garofalo

Rusci

Conti

et al. 2019

Phil. Trans. R. Soc. A.

106

139

View full text Add to dashboard Cite

show abstract

“…The method proposed by Prost-Boucle et al [Prost-Boucle et al 2017] achieves the previously best reported low precision throughput on an FPGA for CIFAR10. This work implements a VGG-7 style network with ternary weights and activations.…”

Section: Comparison With Previous Workmentioning

confidence: 90%

“…Prost-Boucle et al [Prost-Boucle et al 2017]. This paper adopts the scheme used by these authors to buffer the pixels, so the entire set of inputs is available simultaneously.…”

Section: Bufferingmentioning

confidence: 99%

Unrolling Ternary Neural Networks

Tridgell

Kumm

Hardieck

et al. 2019

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

The computational complexity of neural networks for large scale or real-time applications necessitates hardware acceleration. Most approaches assume that the network architecture and parameters are unknown at design time, permitting usage in a large number of applications. This paper demonstrates, for the case where the neural network architecture and ternary weight values are known a priori, that extremely high throughput implementations of neural network inference can be made by customising the datapath and routing to remove unnecessary computations and data movement. This approach is ideally suited to FPGA implementations as a specialized implementation of a trained network improves efficiency while still retaining generality with the reconfigurability of an FPGA. A VGG style network with ternary weights and fixed point activations is implemented for the CIFAR10 dataset on Amazon's AWS F1 instance. This paper demonstrates how to remove 90% of the operations in convolutional layers by exploiting sparsity and compile-time optimizations. The implementation in hardware achieves 90.9 ± 0.1% accuracy and 122 k frames per second, with a latency of only 29 µs, which is the fastest CNN inference implementation reported so far on an FPGA.

show abstract

“…Prost-Boucle et al implemented ternary CNNs on a Xilinx Virtex-7 VC709 FPGA, presenting both high-performance-and low-power-targe ing designs [110]. eir experiments with the CNV model classifying CIFAR-10 demonstrated a 6.6 pp accuracy improvement compared to FINN's binarised inference.…”

Section: Binarisation and Ternarisationmentioning

confidence: 99%

Deep Neural Network Approximation for Custom Hardware

et al. 2019

View full text Add to dashboard Cite

LondonDeep neural networks have proven to be particularly e ective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardwareoriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy e ciency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-e cient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their e ectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. is article represents the rst survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the eld.

show abstract

Scalable high-performance architecture for convolutional ternary neural networks on FPGA

Cited by 61 publications

References 16 publications

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

Unrolling Ternary Neural Networks

Deep Neural Network Approximation for Custom Hardware

Contact Info

Product

Resources

About