ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414076
|View full text |Cite
|
Sign up to set email alerts
|

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Abstract: State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Low-bit deep neural network quantization techniques provides a powerful solution to dramatically reduce their model size. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. To this end, novel mixed precision DNN quantization methods are … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 16 publications
0
4
0
Order By: Relevance
“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”
Section: Introductionmentioning
confidence: 99%
“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”
Section: Introductionmentioning
confidence: 99%
“…Deep learning (DL) technology was proposed by Hinton in 2006 [2] and has been widely used in computer vision [3], speech recognition [4], natural language processing [5] and other fields. Because the deep learning-based method can adaptively extract fault features from a large quantity of signal data for target tasks and has advantages in model construction and generalization performance [6], it has become a research hotspot for experts and scholars to study bearing fault diagnosis.…”
Section: Introductionmentioning
confidence: 99%
“…With the rapid progress of deep neural network (DNN) based ASR technologies in recent decades, the underlying model architectures of NNLMs have evolved from feedforward structures [3]- [6] to more advanced variants represented by long-short term memory recurrent neural networks (LSTM-RNNs) [7]- [10], [18] and recently neural Transformers [11]- [14], [19] that are designed to model longer range contexts. In particular, Transformer based NNLMs in recent years have defined state-of-the-art performance across a range of ASR tasks [11]- [14], [20]. These models [11]- [13], [20] are often constructed using a deep stacking of multiple self-attention based neural building blocks [21]- [23], each of which also includes residual connections [24] and layer normalization modules [25].…”
Section: Introductionmentioning
confidence: 99%
“…In particular, Transformer based NNLMs in recent years have defined state-of-the-art performance across a range of ASR tasks [11]- [14], [20]. These models [11]- [13], [20] are often constructed using a deep stacking of multiple self-attention based neural building blocks [21]- [23], each of which also includes residual connections [24] and layer normalization modules [25]. Additional positional encoding layers [19], [26] are used to augment the self-attention modules with word sequence order information.…”
Section: Introductionmentioning
confidence: 99%