UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision

Lee, Jinmook; Kim, Changhyeon; Kang, Sanghoon; Shin, Dayeon; Kim, Sangjin; Yoo, Hoi‐Jun

doi:10.1109/isscc.2018.8310262

Cited by 268 publications

(111 citation statements)

References 4 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…ASICs [27], [28], [33], [ Helium, an ISA extension tailored for DSP-oriented workloads, such as an inference task. However, such an extension is not supported yet by any device.…”

Section: Performance Energy Efficiency Power Budget Flexibilitymentioning

confidence: 99%

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

Garofalo

Rusci

Conti

et al. 2019

Phil. Trans. R. Soc. A.

119

145

View full text Add to dashboard Cite

We present PULP-NN, an optimized computing library for a parallel ultra-low-power tightly coupled cluster of RISC-V processors. The key innovation in PULP-NN is a set of kernels for quantized neural network inference, targeting byte and sub-byte data types, down to INT-1, tuned for the recent trend toward aggressive quantization in deep neural network inference. The proposed library exploits both the digital signal processing extensions available in the PULP RISC-V processors and the cluster’s parallelism, achieving up to 15.5 MACs/cycle on INT-8 and improving performance by up to 63 × with respect to a sequential implementation on a single RISC-V core implementing the baseline RV32IMC ISA. Using PULP-NN, a CIFAR-10 network on an octa-core cluster runs in 30 × and 19.6 × less clock cycles than the current state-of-the-art ARM CMSIS-NN library, running on STM32L4 and STM32H7 MCUs, respectively. The proposed library, when running on a GAP-8 processor, outperforms by 36.8 × and by 7.45 × the execution on energy efficient MCUs such as STM32L4 and high-end MCUs such as STM32H7 respectively, when operating at the maximum frequency. The energy efficiency on GAP-8 is 14.1 × higher than STM32L4 and 39.5 × higher than STM32H7, at the maximum efficiency operating point. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.

show abstract

“…ASICs [27], [28], [33], [ Helium, an ISA extension tailored for DSP-oriented workloads, such as an inference task. However, such an extension is not supported yet by any device.…”

Section: Performance Energy Efficiency Power Budget Flexibilitymentioning

confidence: 99%

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

Garofalo

Rusci

Conti

et al. 2019

Phil. Trans. R. Soc. A.

119

145

View full text Add to dashboard Cite

show abstract

“…Both implementations in [22] and [20] have a higher power efficiency than Nullhop, but provide consistently lower performances (<350 GOp/s) using more MAC units. They also require a larger area (16 mm 2 ), but this is justified by their support for Recurrent Neural Networks and variable bit precision.…”

Section: Memory Power Consumption Estimationmentioning

confidence: 99%

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

Aimar

Mostafa

Calabrese

et al. 2019

IEEE Trans. Neural Netw. Learning Syst.

237

177

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though graphical processing units are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq field-programmable gate array (FPGA) platform and presented the results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Postsynthesis simulations using Mentor Modelsim in a 28-nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the multiply-accumulate units, and achieves a power efficiency of over 3 TOp/s/W in a core area of 6.3 mm₂. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real-time interactive demonstrations.

show abstract

“…State-of-the-art silicon prototypes such as QUEST [43] or UNPU [44] are exploiting such strong quantization and voltage scaling and have been able to measure such high energy efficiency with their devices. The UNPU reaches an energy efficiency of 50.6 TOp/s/W at a throughput of 184 GOp/s with 1bit weights and 16-bit activations on 16 mm 2 of silicon in 65 nm technology.…”

Section: Fpga and Asic Acceleratorsmentioning

confidence: 99%

“…Hyperdrive not only exploits the advantages of reduced weight memory requirements and computational complexity, but fundamentally differs from previous BWN accelerators [26,44,45]. The main concepts can be summarized as: 1) Feature Maps are stored entirely on-chip, instead the weights are streamed to the chip (i.e., feature map stationary).…”

Section: Fpga and Asic Acceleratorsmentioning

confidence: 99%

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Andri

Cavigelli

Rossi

et al. 2019

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W systemlevel efficiency (i.e., including I/Os)-3.1× higher than state-ofthe-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness.

show abstract

UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision

Cited by 268 publications

References 4 publications

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Contact Info

Product

Resources

About