DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration

Song, Zhuoran; Fu, Bangqi; Wu, Feiyang; Jiang, Zhaoming; Jiāng, Lì; Jing, Naifeng; Liang, Xiaoyao

doi:10.1109/isca45697.2020.00086

Cited by 71 publications

(30 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2) Avoid Top-k Selection: Another problem is how to avoid top-k selection. Instead of sorting all the scores, we can use mean-filtering [58] to search for the important scores. Specifically, in each round, we estimate each row's mean value and only select the query-key pairs whose scores are greater than the mean value.…”

Section: A Top-k Pruning: a Baselinementioning

confidence: 99%

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

Zhu¹,

Liu²,

Gu³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks. Despite their effectiveness, transformers' attention operations are hard to accelerate due to complicated data movement and quadratic computational complexity, prohibiting the real-time inference on resource-constrained edge-computing platforms.To tackle this challenge, we propose Energon, an algorithmarchitecture co-design approach that accelerates various transformers using dynamic sparse attention. With the observation that attention results only depend on a few important query-key pairs, we propose a multi-round filtering algorithm to dynamically identify such pairs at runtime. We adopt low bitwidth in each filtering round and only use high-precision tensors in the attention stage to reduce overall complexity. By this means, we significantly mitigate the computational cost with negligible accuracy loss. To enable such an algorithm with lower latency and better energyefficiency, we also propose an Energon co-processor architecture. Elaborated pipelines and specialized optimizations jointly boost the performance and reduce power consumption. Extensive experiments on both NLP and CV benchmarks demonstrate that Energon achieves 161× and 8.4× geo-mean speedup and up to 10 4 × and 10 3 × energy reduction compared with Intel Xeon 5220 CPU and NVIDIA V100 GPU. Compared to state-of-the-art attention accelerators SpAtten and A 3 , Energon also achieves 1.7×, 1.25× speedup and 1.6×, 1.5× higher energy efficiency.* In the famous film series Transformers, Energon is the preferred fuel of the Transformer race.

show abstract

Section: A Top-k Pruning: a Baselinementioning

confidence: 99%

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

Zhu¹,

Liu²,

Gu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Thanks to recent advances in DNN compression algorithms [6,11,18,27], parameters of DNN can be converted from 32-bit floating point to extremely low bit-width (e.g., < 4-bits) with negligible inference accuracy degradation, but significantly simplify the computation complexity and mitigate the on-/off-chip data access bottleneck (aka. "memory wall") [32,36].…”

Section: Introductionmentioning

confidence: 99%

N3H-Core

Gong

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Self Cite

View full text Add to dashboard Cite

Accelerating the neural network inference by FPGA has emerged as a popular option, since the reconfigurability and high performance computing capability of FPGA intrinsically satisfies the computation demand of the fast-evolving neural algorithms. However, the popular neural accelerators on FPGA (e.g., Xilinx DPU) mainly utilize the DSP resources for constructing their processing units, while the rich LUT resources are not well exploited. Via the software-hardware co-design approach, in this work, we develop an FPGA-based heterogeneous computing system for neural network acceleration. From the hardware perspective, the proposed accelerator consists of DSP-and LUT-based GEneral Matrix-Multiplication (GEMM) computing cores, which forms the entire computing system in a heterogeneous fashion. The DSP-and LUTbased GEMM cores are computed w.r.t a unified Instruction Set Architecture (ISA) and unified buffers. Along the data flow of the neural network inference path, the computation of the convolution/fullyconnected layer is split into two portions, handled by the DSP-and LUT-based GEMM cores asynchronously. From the software perspective, we mathematically and systematically model the latency and resource utilization of the proposed heterogeneous accelerator, regarding varying system design configurations. Through leveraging the reinforcement learning technique, we construct a framework to achieve end-to-end selection and optimization of the design specification of target heterogeneous accelerator, including workload split strategy, mixed-precision quantization scheme, and resource allocation of DSP-and LUT-core. In virtue of the proposed design framework and heterogeneous computing system, our design outperforms the state-of-the-art Mix&Match design with latency reduced by 1.12-1.32× with higher inference accuracy. The N 3 H-Core is open-sourced at: https://github.com/elliothe/N3H_Core.

show abstract

“…Some researches exploit the property of DNN to reduce latency by using the parallel characteristics of special acceleration circuit design, such as [ 8 , 9 , 10 , 11 , 12 , 13 , 14 ]. Yet these works ignore that the whole power consumption exceeds budget.…”

Section: Introductionmentioning

confidence: 99%

“…To alleviate the poor-performance problems, a number of studies have been undertaken to accelerate DNN implementations by designing hardware-accelerated intelligent computing architecture for sensing system. Some researches exploit the property of DNN to reduce latency by using the parallel characteristics of special acceleration circuit design, such as [8][9][10][11][12][13][14]. Yet these works ignore that the whole power consumption exceeds budget.…”

Section: Introductionmentioning

confidence: 99%

A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System

Zhang¹,

et al. 2021

Sensors

View full text Add to dashboard Cite

Extracting features from sensing data on edge devices is a challenging application for which deep neural networks (DNN) have shown promising results. Unfortunately, the general micro-controller-class processors which are widely used in sensing system fail to achieve real-time inference. Accelerating the compute-intensive DNN inference is, therefore, of utmost importance. As the physical limitation of sensing devices, the design of processor needs to meet the balanced performance metrics, including low power consumption, low latency, and flexible configuration. In this paper, we proposed a lightweight pipeline integrated deep learning architecture, which is compatible with open-source RISC-V instructions. The dataflow of DNN is organized by the very long instruction word (VLIW) pipeline. It combines with the proposed special intelligent enhanced instructions and the single instruction multiple data (SIMD) parallel processing unit. Experimental results show that total power consumption is about 411 mw and the power efficiency is about 320.7 GOPS/W.

show abstract

DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration

Cited by 71 publications

References 24 publications

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

N3H-Core

A Heterogeneous RISC-V Processor for Efficient DNN Application in Smart Sensing System

Contact Info

Product

Resources

About