Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Xu, Junhao; Hu, Shoukang; Yu, Jianwei; Liu, Xunying; Meng, Helen

doi:10.1109/icassp39728.2021.9414076

Cited by 8 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior research for Transformer based speech processing models has largely evolved into two categories: 1) architecture compression methods that aim to minimize the Transformer model structural redundancy measured by their depth, width, sparsity, or their combinations using techniques such as pruning [8][9][10], low-rank matrix factorization [11,12] and distillation [13,14]; and 2) low-bit quantization approaches that use either uniform [15][16][17][18], or mixed precision [12,19] settings. A combination of both architecture compression and low-bit quantization approaches has also been studied to produce larger model compression ratios [12].…”

Section: Introductionmentioning

confidence: 99%

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstructionbased pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

show abstract

Section: Introductionmentioning

confidence: 99%

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

show abstract

“…Deep learning (DL) technology was proposed by Hinton in 2006 [2] and has been widely used in computer vision [3], speech recognition [4], natural language processing [5] and other fields. Because the deep learning-based method can adaptively extract fault features from a large quantity of signal data for target tasks and has advantages in model construction and generalization performance [6], it has become a research hotspot for experts and scholars to study bearing fault diagnosis.…”

Section: Introductionmentioning

confidence: 99%

An anti-noise fault diagnosis approach for rolling bearings based on multiscale CNN-LSTM and a deep residual learning model

Chen¹,

Wei²,

Li³

et al. 2023

Meas. Sci. Technol.

View full text Add to dashboard Cite

Bearing fault vibration signals collected in real engineering cases often contain environmental noise, which easily causes the fault type characteristics of vibration signals to not be apparent, making it difficult to determine the corresponding fault type when traditional deep learning methods are used for fault diagnosis. To solve the above problems, a neural network model named MCL-DRL was designed, which combines a multiscale wide convolution kernel CNN-LSTM module and a deep residual module for rolling bearing fault diagnosis. In this model, a wide convolution kernel CNN-LSTM structure with different convolution scales is used to extract a variety of different types of frequency and sequential features from vibration signals. It is worth noting that the wide convolution kernel CNN-LSTM structure not only has stronger feature extraction performance compared with the common convolution layer but can also reduce the interference of high-frequency noise. Moreover, the deep residual module with a wide convolution kernel CNN-LSTM structure is used to further improve the feature expression ability of the proposed model. The above algorithm enables the proposed model to better extract the fault features hidden in the noise signal.When compared with some state-of-the-art methods, the experimental results showed that this model has better antinoise performance and better generalization ability for rolling bearing fault diagnosis.

show abstract

“…With the rapid progress of deep neural network (DNN) based ASR technologies in recent decades, the underlying model architectures of NNLMs have evolved from feedforward structures [3]- [6] to more advanced variants represented by long-short term memory recurrent neural networks (LSTM-RNNs) [7]- [10], [18] and recently neural Transformers [11]- [14], [19] that are designed to model longer range contexts. In particular, Transformer based NNLMs in recent years have defined state-of-the-art performance across a range of ASR tasks [11]- [14], [20]. These models [11]- [13], [20] are often constructed using a deep stacking of multiple self-attention based neural building blocks [21]- [23], each of which also includes residual connections [24] and layer normalization modules [25].…”

Section: Introductionmentioning

confidence: 99%

“…In particular, Transformer based NNLMs in recent years have defined state-of-the-art performance across a range of ASR tasks [11]- [14], [20]. These models [11]- [13], [20] are often constructed using a deep stacking of multiple self-attention based neural building blocks [21]- [23], each of which also includes residual connections [24] and layer normalization modules [25]. Additional positional encoding layers [19], [26] are used to augment the self-attention modules with word sequence order information.…”

Section: Introductionmentioning

confidence: 99%

Bayesian Neural Network Language Modeling for Speech Recognition

Xue

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audiovisual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.

show abstract

Mixed Precision Quantization of Transformer Language Models for Speech Recognition

Cited by 8 publications

References 16 publications

Identity of university Chinese heritage language learners in Hong Kong

Identity of university Chinese heritage language learners in Hong Kong

An anti-noise fault diagnosis approach for rolling bearings based on multiscale CNN-LSTM and a deep residual learning model

Bayesian Neural Network Language Modeling for Speech Recognition

Contact Info

Product

Resources

About