Librispeech Transducer Model with Internal Language Model Prior Correction

Zeyer, Albert; Merboldt, André; Michel, Wilfried; Schlüter, Ralf; Ney, Hermann

doi:10.48550/arxiv.2104.03006

Cited by 2 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Considering the variability of different pronunciations, the window size should be different for different query vector concerns; thus, a parameterized window size calculation method is used rather than a fixed window size. The calculation method is shown in Equation (9).…”

Section: 𝑙 𝑖 = 𝐼 • 𝜎(𝑼 𝑇 𝑔(𝑾(𝑬 𝑥 𝑖 + 𝒖 + 𝒗)))mentioning

confidence: 99%

“…LAS+Specaugment [27] is a sequence to sequence framework with Specaugment. LSTM Transducer [9] improves external language model and an estimated internal LM. The hybrid model with Transformer rescoring [30] leverages the Transformer to improve hybrid acoustic modeling.…”

Section: Comparison Experimentsmentioning

confidence: 99%

“…End-to-end models can be broadly divided into three different categories depending on their implementations of soft alignment: connectionist temporal classification (CTC-based) [2][3][4][5], recurrent neural networks-transducer (RNN-T) [6][7][8][9], and attentionbased encoder-decoder (AED) [10][11][12][13]. CTC [2] uses a single network structure to map input sequences directly to output sequences to solve the problem of data alignment and direct modeling.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Liu

Yang

2022

Information

View full text Add to dashboard Cite

Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.

show abstract

Section: 𝑙 𝑖 = 𝐼 • 𝜎(𝑼 𝑇 𝑔(𝑾(𝑬 𝑥 𝑖 + 𝒖 + 𝒗)))mentioning

confidence: 99%

Section: Comparison Experimentsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Liu

Yang

2022

Information

View full text Add to dashboard Cite

show abstract

“…In shallow fusion [7,8], a log-linear interpolation between the E2E model score and the LM score is computed at each step of the beam search. To improve shallow fusion, internal LM estimation-based fusion [9,10,11,12,13,14] was proposed to estimate an internal LM (ILM) score and subtract it from the shallow fusion score. However, all these methods require an external LM during inference, increasing decoding time and computational cost.…”

Section: Introductionmentioning

confidence: 99%

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Meng¹,

Gaur²,

Kanda³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.

show abstract

Librispeech Transducer Model with Internal Language Model Prior Correction

Cited by 2 publications

References 0 publications

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Contact Info

Product

Resources

About