QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS

Ueyoshi, Kodai; Ando, Keiichi; Hirose, Kazutoshi; Takamaeda-Yamazaki, Shinya; Kadomoto, Junichiro; Miyata, Toshio; Hamada, Mototsugu; Kuroda, Tadahiro; Motomura, Masato

doi:10.1109/isscc.2018.8310261

Cited by 76 publications

(43 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…State-of-the-art silicon prototypes such as QUEST [43] or UNPU [44] are exploiting such strong quantization and voltage scaling and have been able to measure such high energy efficiency with their devices. The UNPU reaches an energy efficiency of 50.6 TOp/s/W at a throughput of 184 GOp/s with 1bit weights and 16-bit activations on 16 mm 2 of silicon in 65 nm technology.…”

Section: Fpga and Asic Acceleratorsmentioning

confidence: 99%

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Andri

Cavigelli

Rossi

et al. 2019

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W systemlevel efficiency (i.e., including I/Os)-3.1× higher than state-ofthe-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness.

show abstract

Section: Fpga and Asic Acceleratorsmentioning

confidence: 99%

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Andri

Cavigelli

Rossi

et al. 2019

IEEE J. Emerg. Sel. Topics Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…al [5] and the QUEST log-quantized 3D-stacked inference engine by Ueyoshi et. al [6]. Indeed, bit-serial operand feeding implicitly allows fully-variable bit precision.…”

Section: Bit-serial Designsmentioning

confidence: 99%

“…It has led to a new trend for precision-scalable neural processors to minimize energy at target performance without giving up flexibility. Recent papers have introduced runtime configurable MAC architectures optimized for deep learning, built either with high parallelization capabilities [3], [4] or bit-serial approaches [5], [6].…”

Section: Introductionmentioning

confidence: 99%

Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing

Camus

Enz

Verhelst

2019

2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)

View full text Add to dashboard Cite

The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. Precision-scalable MAC architectures optimized for neural networks have recently gained interest thanks to their subword parallel or bit-serial capabilities. Yet, it has been hard to make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. In this work, run-time configurable MAC units from ISSCC 2017 and 2018 are implemented and compared objectively under diverse precision scenarios. All circuits are synthesized in a 28 nm commercial CMOS process with precision ranging from 2 to 8 bits. This work analyzes the impact of scalability and compares the different MAC units in terms of energy, throughput and area, aiming to understand the optimal architectures to reduce computation costs in neural-network processing.

show abstract

“…6 (a)) forms a primitive binary neural network accelerator based on the typical output-inputchannel parallelism, where each PE row corresponds to an input channel and each PE column corresponds to an output channel. This is a binary-only subset of a single core of the architecture proposed in [4]; the weights and inputs/outputs are all in 1-bit. In this configuration, an input activation is shared among multiple output channels (i.e.…”

Section: Fpga Implementationmentioning

confidence: 99%

Dither NN: Hardware/Algorithm Co-Design for Accurate Quantized Neural Networks

Ando

Ueyoshi

Oba

et al. 2019

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

Deep neural network (NN) has been widely accepted for enabling various AI applications, however, the limitation of computational and memory resources is a major problem on mobile devices. Quantized NN with a reduced bit precision is an effective solution, which relaxes the resource requirements, but the accuracy degradation due to its numerical approximation is another problem. We propose a novel quantized NN model employing the "dithering" technique to improve the accuracy with the minimal additional hardware requirement at the view point of the hardware-algorithm co-designing. Dithering distributes the quantization error occurring at each pixel (neuron) spatially so that the total information loss of the plane would be minimized. The experiment we conducted using the software-based accuracy evaluation and FPGA-based hardware resource estimation proved the effectiveness and efficiency of the concept of an NN model with dithering. key words: neural network, dithering, error diffusion, FPGA, hardwareoriented neural network algorithm

show abstract

QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS

Cited by 76 publications

References 3 publications

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

Survey of Precision-Scalable Multiply-Accumulate Units for Neural-Network Processing

Dither NN: Hardware/Algorithm Co-Design for Accurate Quantized Neural Networks

Contact Info

Product

Resources

About