2021 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE) 2021
DOI: 10.23919/date51398.2021.9474043
|View full text |Cite
|
Sign up to set email alerts
|

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

Abstract: BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the intermediate results. Experiments demonstrate that the FQ-BERT can achieve 7.94×compression for weights with neg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(6 citation statements)
references
References 6 publications
0
6
0
Order By: Relevance
“…Further, a more fine-grained mapping way and more universal modules allows our DSP slices utilized to be lower than others. As can be seen from the Table 6, our throughput per DSP is more than 3.59× that of [8] and 21.07× the FQ-BERT in [19], much better than existing accelerators. This further illustrates the advantages of EFA-Trans in terms of resource utilization.…”
Section: Performance Comparisonmentioning
confidence: 86%
See 1 more Smart Citation
“…Further, a more fine-grained mapping way and more universal modules allows our DSP slices utilized to be lower than others. As can be seen from the Table 6, our throughput per DSP is more than 3.59× that of [8] and 21.07× the FQ-BERT in [19], much better than existing accelerators. This further illustrates the advantages of EFA-Trans in terms of resource utilization.…”
Section: Performance Comparisonmentioning
confidence: 86%
“…In comparison experiments with other works based on FPGA, Ftrans [5] was built on VCU118 FPGA via Vivado HLS, which achieves 2.94 ms in accelerating the shallow transformer. FQ-BERT was a fully quantized BERT proposed by [19]. Authors of [8] created an algorithm-hardware codesign for the attention mechanism, which can reach 1.87 Tops on the runtime throughput performance in ZCU102.…”
Section: Performance Comparisonmentioning
confidence: 99%
“…The work implements a large SA accelerator using a hardware description language (HDL) and evaluates it on a Xilinx FPGA, considering a single transformer model. In [11], again, a systolic accelerator is implemented at the HDL level. In this hardware/software codesign work, the BERT model is fully quantized to 8 bits and 4 bits to reduce the latency and computation.…”
Section: Related Workmentioning
confidence: 99%
“…Different from [154] on quantifying input data, some scholars have devoted themselves to NLP task optimization based on the BERT (bidirectional encoder representation from transformers) network model [155] and have adopted the idea of full quantization to a design accelerator. Not only input data but also weights, activations, Softmax, layer normalization, and all the intermediate results are quantified in order to compress the network and improve performance [156].…”
Section: Fpga Accelerator For Natural Language Processingmentioning
confidence: 99%