2020
DOI: 10.1109/taslp.2020.3015659
|View full text |Cite
|
Sign up to set email alerts
|

Neural Network Language Model Compression With Product Quantization and Soft Binarization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 24 publications
0
10
0
Order By: Relevance
“…First, to the best of our knowledge, this paper is the first work to apply mixed precision quantization methods to Transformer language models. In contrast, previous researches on low-bit quantization focused on convolutional neural networks (CNNs) [22] and LSTM-RNN LMs [23], where expert designed special partially quantized linear layers containing binary weight matrices, full precision bias and additional scaling parameters were used mitigate the performance degradation due to uniform precision quantization.…”
Section: Introductionmentioning
confidence: 99%
“…First, to the best of our knowledge, this paper is the first work to apply mixed precision quantization methods to Transformer language models. In contrast, previous researches on low-bit quantization focused on convolutional neural networks (CNNs) [22] and LSTM-RNN LMs [23], where expert designed special partially quantized linear layers containing binary weight matrices, full precision bias and additional scaling parameters were used mitigate the performance degradation due to uniform precision quantization.…”
Section: Introductionmentioning
confidence: 99%
“…Another powerful family of techniques recently drawing increasing interest across the machine learning, computer vision and speech technology communities to solve this problem is to use low-bit DNN quantization techniques [31]- [37], [52], [57], [58], [62], [74], [75]. By replacing floating point based DNN parameters with low precision values, for example, binary numbers, model sizes can be dramatically reduced without changing the DNN architecture [32], [57], [73].…”
Section: Introductionmentioning
confidence: 99%
“…Further DNN size reduction can be obtained when low-precision quantization is used in combination with neural architecture search (NAS) techniques, for example, in the SqueezeNet system designed for computer vision tasks [52]. In contrast to the extensive prior research works on low-bit quantization methods primarily targeting computer vision tasks [31]- [37], [52], only limited previous research in this direction has been conducted in the context of language modelling [57], [58] and ASR systems [56], [59].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations