Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.433
|View full text |Cite
|
Sign up to set email alerts
|

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Abstract: The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners. Moreover, even inside an em… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 20 publications
0
13
0
Order By: Relevance
“…For the Transformers and their variants, the processing time of matrix multiplications dominates the entire inference latency because of higher time complexity compared to activation functions, normalization layers, and so on [27], [28], [30]. To validate such a claim, Fig.…”
Section: B Gpu-accelerated Generative Lmsmentioning
confidence: 99%
See 2 more Smart Citations
“…For the Transformers and their variants, the processing time of matrix multiplications dominates the entire inference latency because of higher time complexity compared to activation functions, normalization layers, and so on [27], [28], [30]. To validate such a claim, Fig.…”
Section: B Gpu-accelerated Generative Lmsmentioning
confidence: 99%
“…which does not have analytical solutions except when q = 1. Thus, scaling factors and binary vectors are obtained by iterative search methods [20], [34] or by quantization-aware training [30]. In this work, we discuss the unique mathematical properties of BCQ to enable efficient quantized matrix multiplications.…”
Section: Binary-coding Quantizationmentioning
confidence: 99%
See 1 more Smart Citation
“…quantization or faster search algorithms. Quantization can be used to speed up inference and relax hardware requirements, as has been shown for e.g., 8-bit (Quinn andBallesteros, 2018), 4-bit (Aji and and recently also below 3-bit quantization (Chung et al, 2020) of NMT models. In the wider NLP space, there has been interest in evaluating the trade-offs of different compression techniques for downstream finetuning.…”
Section: A1 Overview Compressionmentioning
confidence: 99%
“…Experiments on the WMT14 En-De, WMT14 En-Fr and NIST12 Zh-En machine translation (MT) benchmarks demonstrate that the improved system achieves a 3.80× speedup on CPU and a 2.52× speedup on GPU with performance on par with the baseline. The speedup obtained is available on most modern hardware, as it does not depend on specific hardware or library, e.g., quantization (Chung et al, 2020) and unstructured pruning (Hoefler et al, 2021) require the support of the latest hardware-dependent and acceleration libraries.…”
Section: Introductionmentioning
confidence: 99%