Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

Liu, Zejian; Li, Gang; Cheng, Jian

doi:10.23919/date51398.2021.9474043

Cited by 26 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, a more fine-grained mapping way and more universal modules allows our DSP slices utilized to be lower than others. As can be seen from the Table 6, our throughput per DSP is more than 3.59× that of [8] and 21.07× the FQ-BERT in [19], much better than existing accelerators. This further illustrates the advantages of EFA-Trans in terms of resource utilization.…”

Section: Performance Comparisonmentioning

confidence: 86%

“…In comparison experiments with other works based on FPGA, Ftrans [5] was built on VCU118 FPGA via Vivado HLS, which achieves 2.94 ms in accelerating the shallow transformer. FQ-BERT was a fully quantized BERT proposed by [19]. Authors of [8] created an algorithm-hardware codesign for the attention mechanism, which can reach 1.87 Tops on the runtime throughput performance in ZCU102.…”

Section: Performance Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Yang

2022

Electronics

View full text Add to dashboard Cite

The topic of transformers is rapidly emerging as one of the most important key primitives in neural networks. Unfortunately, most hardware designs for transformers are deficient, either hardly considering the configurability of the design or failing to realize the complete inference process of transformers. Specifically, few studies have paid attention to the compatibility of different computing paradigms. Thus, this paper presents EFA-Trans, a highly efficient and flexible hardware accelerator architecture for transformers. To reach high performance, we propose a configurable matrix computing array and leverage on-chip memories optimizations. In addition, with the design of nonlinear modules and fine-grained scheduling, our architecture can perform complete transformer inference. EFA-Trans is also compatible with dense and sparse patterns, which further expands its application scenarios. Moreover, a performance analytic model is abstracted to guide the determination of architecture parameter sets. Finally, our designs are developed by RTL and evaluated on Xilinx ZCU102. Experimental results demonstrate that EFA-Trans provides 23.74× and 7.58× improvement in energy efficiency compared with CPU and GPU, respectively. It also shows DSP efficiency is between 3.59× and 21.07× higher than others, outperforming existing advanced works.

show abstract

Section: Performance Comparisonmentioning

confidence: 86%

Section: Performance Comparisonmentioning

confidence: 99%

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Yang

2022

Electronics

View full text Add to dashboard Cite

show abstract

“…The work implements a large SA accelerator using a hardware description language (HDL) and evaluates it on a Xilinx FPGA, considering a single transformer model. In [11], again, a systolic accelerator is implemented at the HDL level. In this hardware/software codesign work, the BERT model is fully quantized to 8 bits and 4 bits to reduce the latency and computation.…”

Section: Related Workmentioning

confidence: 99%

TiC-SAT

Amirshahi

Klein

Ansaloni

et al. 2023

Proceedings of the 28th Asia and South Pacific Design Automation Conference

View full text Add to dashboard Cite

Transformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and Vision-Transformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework 1 . CCS CONCEPTS• Computer systems organization → Neural networks; Systolic arrays; • Computing methodologies → Natural language processing; • Hardware → Hardware-software codesign.

show abstract

“…Different from [154] on quantifying input data, some scholars have devoted themselves to NLP task optimization based on the BERT (bidirectional encoder representation from transformers) network model [155] and have adopted the idea of full quantization to a design accelerator. Not only input data but also weights, activations, Softmax, layer normalization, and all the intermediate results are quantified in order to compress the network and improve performance [156].…”

Section: Fpga Accelerator For Natural Language Processingmentioning

confidence: 99%

A Review of the Optimal Design of Neural Networks Based on FPGA

Wang

Luo

2022

Applied Sciences

View full text Add to dashboard Cite

Deep learning based on neural networks has been widely used in image recognition, speech recognition, natural language processing, automatic driving, and other fields and has made breakthrough progress. FPGA stands out in the field of accelerated deep learning with its advantages such as flexible architecture and logic units, high energy efficiency ratio, strong compatibility, and low delay. In order to track the latest research results of neural network optimization technology based on FPGA in time and to keep abreast of current research hotspots and application fields, the related technologies and research contents are reviewed. This paper introduces the development history and application fields of some representative neural networks and points out the importance of studying deep learning technology, as well as the reasons and advantages of using FPGA to accelerate deep learning. Several common neural network models are introduced. Moreover, this paper reviews the current mainstream FPGA-based neural network acceleration technology, method, accelerator, and acceleration framework design and the latest research status, pointing out the current FPGA-based neural network application facing difficulties and the corresponding solutions, as well as prospecting the future research directions. We hope that this work can provide insightful research ideas for the researchers engaged in the field of neural network acceleration based on FPGA.

show abstract

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

Cited by 26 publications

References 6 publications

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

TiC-SAT

A Review of the Optimal Design of Neural Networks Based on FPGA

Contact Info

Product

Resources

About