2021
DOI: 10.48550/arxiv.2103.02800
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 6 publications
0
5
0
Order By: Relevance
“…While pruning needs balanced sparsity for high resource usage efficiency, quantization is naturally more friendly to FPGA implementations. The method in (Liu et al, 2021b) employed 8 × 4-bit and 8 × 8-bit quantization on different parts of BERT. VAQF differs from previous work in the following aspects: 1) The quantization process is guided by the compilation step that determines the required activation precision given the target frame rate; 2) The precision for activation quantization is chosen from a wider range to meet a specific real-time frame rate requirement.…”
Section: Transformer Acceleration On Fpgasmentioning
confidence: 99%
See 1 more Smart Citation
“…While pruning needs balanced sparsity for high resource usage efficiency, quantization is naturally more friendly to FPGA implementations. The method in (Liu et al, 2021b) employed 8 × 4-bit and 8 × 8-bit quantization on different parts of BERT. VAQF differs from previous work in the following aspects: 1) The quantization process is guided by the compilation step that determines the required activation precision given the target frame rate; 2) The precision for activation quantization is chosen from a wider range to meet a specific real-time frame rate requirement.…”
Section: Transformer Acceleration On Fpgasmentioning
confidence: 99%
“…Our accelerators are further compared with previous work, CPU and GPU with regard to FPS, power, and energy efficiency. Since no study using quantization has been carried out for ViT acceleration on FPGAs, the accelerators for BERT in (Liu et al, 2021b) with 8 × 4-bit and 8 × 8-bit quantization are used for comparison, and other implementation results are all obtained for DeiT-base. As shown in Table 6, our W1A8 and W1A6 accelerators both outperform the BERT design on ZCU102 in terms of FPS, power and energy efficiency, and the W1A6 design has the highest FPS/W among all implementations.…”
Section: Comparison With Other Implementationsmentioning
confidence: 99%
“…In many edge devices, computational resources are minimal (Liu, Li, and Cheng 2021). It is not easy to run two large parameter models simultaneously, which leads to the necessity of releasing the resources of the old model to reload the new model when switching tasks, a process we call redeployment.…”
Section: The Efficiency Of Plug-taggermentioning
confidence: 99%
“…For non-linear functions such as Softmax and LayerNorm, it analyses and designs detailed computational flows and hardware structures. [16] investigates the acceleration of BERT on FPGAs. To reduce the memory footprint, the authors fully quantize all parameters of BERT, including weights, activations, scale factors, SoftMax, layer normalization and other intermediate results.…”
Section: Customized Acceleratorsmentioning
confidence: 99%
“…In comparison experiments with other works based on reconfigurable device, to our knowledge, no researcher has applied Transformer-based visual models to the field of reconfigurable computation, so this paper compares our design to other Transformer model in the NLP. FQ-BERT is a fully quantized Bert proposed by [16]. The 8-bit NPE of NVU-1024 [23] is an overlay processor for BERT-based models.…”
Section: Comparison With Cpu Gpu Fpgamentioning
confidence: 99%