Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Chung, Insoo; Kim, Byeongwook; Choi, Yuni; Kwon, Se Jung; Jeon, Yongkweon; Park, Baeseong; Kim, Sang-Ha; Lee, Dongsoo

doi:10.18653/v1/2020.findings-emnlp.433

Cited by 16 publications

(13 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the Transformers and their variants, the processing time of matrix multiplications dominates the entire inference latency because of higher time complexity compared to activation functions, normalization layers, and so on [27], [28], [30]. To validate such a claim, Fig.…”

Section: B Gpu-accelerated Generative Lmsmentioning

confidence: 99%

“…which does not have analytical solutions except when q = 1. Thus, scaling factors and binary vectors are obtained by iterative search methods [20], [34] or by quantization-aware training [30]. In this work, we discuss the unique mathematical properties of BCQ to enable efficient quantized matrix multiplications.…”

Section: Binary-coding Quantizationmentioning

confidence: 99%

“…Again, an (m × m) matrix is multiplied by an (m × 1) matrix while we select m to be 8192, 12288 (used for GPT-3 175B), or 16384. For Table IV, we include the case of q = 2 for nuQmm (with g = m) as 2-bit quantization for the Transformer is reported to be feasible by quantization-aware training along with BCQ format [30]. We notice that throughout all m configurations, increasing GPUs for cuBLAS with tensor parallelism brings about a higher reduction in GPU utilization, memory utilization, and latency ratio of computations.…”

Section: Comparison With Tensor Parallelism (Of Full-precision)mentioning

confidence: 99%

See 2 more Smart Citations

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park¹,

Baeseong²,

Kim³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for largescale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. We demonstrate that nuQmm can accelerate the inference speed of the GPT-3 (175B) model by about 14.4 times and save energy consumption by 93%.

show abstract

Section: B Gpu-accelerated Generative Lmsmentioning

confidence: 99%

Section: Binary-coding Quantizationmentioning

confidence: 99%

Section: Comparison With Tensor Parallelism (Of Full-precision)mentioning

confidence: 99%

See 1 more Smart Citation

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Park¹,

Baeseong²,

Kim³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…quantization or faster search algorithms. Quantization can be used to speed up inference and relax hardware requirements, as has been shown for e.g., 8-bit (Quinn andBallesteros, 2018), 4-bit (Aji and and recently also below 3-bit quantization (Chung et al, 2020) of NMT models. In the wider NLP space, there has been interest in evaluating the trade-offs of different compression techniques for downstream finetuning.…”

Section: A1 Overview Compressionmentioning

confidence: 99%

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

Ahia¹,

Kreutzer²,

Hooker³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

A "bigger is better" explosion in the number of parameters in deep neural networks has made it increasingly challenging to make stateof-the-art networks accessible in computerestricted environments. Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques has been centered on high-resource datasets. In this work, we instead consider the impact of compression in a data-limited regime. We introduce the term low-resource double bind to refer to the co-occurrence of data limitations and compute resource constraints. This is a common setting for NLP for low-resource languages, yet the trade-offs in performance are poorly studied. Our work offers surprising insights into the relationship between capacity and generalization in data-limited regimes for the task of machine translation. Our experiments on magnitude pruning for translations from English into Yoruba, Hausa, Igbo and German show that in low-resource regimes, sparsity preserves performance on frequent sentences but has a disparate impact on infrequent ones. However, it improves robustness to out-of-distribution shifts, especially for datasets that are very distinct from the training distribution. Our findings suggest that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising solution to the low-resource double bind.

show abstract

“…Experiments on the WMT14 En-De, WMT14 En-Fr and NIST12 Zh-En machine translation (MT) benchmarks demonstrate that the improved system achieves a 3.80× speedup on CPU and a 2.52× speedup on GPU with performance on par with the baseline. The speedup obtained is available on most modern hardware, as it does not depend on specific hardware or library, e.g., quantization (Chung et al, 2020) and unstructured pruning (Hoefler et al, 2021) require the support of the latest hardware-dependent and acceleration libraries.…”

Section: Introductionmentioning

confidence: 99%

Bag of Tricks for Optimizing Transformer Efficiency

Lin¹,

Li²,

Xiao³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or dependent on hardware. In this paper, we show that the efficiency of Transformer can be improved by combining some simple and hardware-agnostic methods, including tuning hyper-parameters, better design choices and training strategies. On the WMT news translation tasks, we improve the inference efficiency of a strong Transformer system by 3.80× on CPU and 2.52× on GPU. The code is publicly available at https://github.com/Lollipop321/minidecoder-network.

show abstract

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Cited by 16 publications

References 20 publications

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

Bag of Tricks for Optimizing Transformer Efficiency

Contact Info

Product

Resources

About