LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring

Beck, E.; Zhou, Wei; Schlüter, Ralf; Ney, Hermann

doi:10.48550/arxiv.1907.01030

Cited by 10 publications

(22 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The optimal tradeoff between number of epochs in combination with multi-stage phonetic training has still not been explored and is left for future work. For recognition we use 4-gram [33] and LSTM language models [34]. We also include a second pass rescoring with a Transformer (Trafo) LM for one of our experiments [35].…”

Section: Experimental Settingmentioning

confidence: 99%

Improving Factored Hybrid HMM Acoustic Modeling without State Tying

Raissi¹,

Beck²,

Schlüter³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we show that a factored hybrid hidden Markov model (FH-HMM) which is defined without any phonetic state-tying outperforms a state-of-the-art hybrid HMM. The factored hybrid HMM provides a link to transducer models in the way it models phonetic (label) context while preserving the strict separation of acoustic and language model of the hybrid HMM approach. Furthermore, we show that the factored hybrid model can be trained from scratch without using phonetic state-tying in any of the training steps. Our modeling approach enables triphone context while avoiding phonetic state-tying by a decomposition into locally normalized factored posteriors for monophones/HMM states in phoneme context. Experimental results are provided for Switchboard 300h and LibriSpeech. On the former task we also show that by avoiding the phonetic state-tying step, the factored hybrid can take better advantage of regularization techniques during training, compared to the standard hybrid HMM with phonetic state-tying based on classification and regression trees (CART).

show abstract

Section: Experimental Settingmentioning

confidence: 99%

Improving Factored Hybrid HMM Acoustic Modeling without State Tying

Raissi¹,

Beck²,

Schlüter³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Neural network LMs are shown to bring consistent improvements over count-based LMs [14,1,2]. These neural LMs are then used either in second-pass lattice rescoring or first pass decoding for ASR [3,4,5,15,16]. To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization, various sampling-based training criteria are proposed and investigated [6,7,8,9,10,11,12].…”

Section: Related Workmentioning

confidence: 99%

“…Nowadays, word-based neural language models (LMs) consistently give better perplexities than count-based language models [1,2], and are commonly used for second-pass rescoring or first-pass decoding of automatic speech recognition (ASR) outputs [3,4,5]. One challenge to train such LMs, especially when the vocabulary size is large, is the traversal over the full vocabulary in the softmax normalization.…”

Section: Introductionmentioning

confidence: 99%

Self-Normalized Importance Sampling for Neural Language Modeling

Yang¹,

Gao²,

Gerstenberger³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of perplexity and almost no visible drop in word error rate. While noise contrastive estimation is one of the most popular choices, recently we show that other sampling-based criteria can also perform well, as long as an extra correction step is done, where the intended class posterior probability is recovered from the raw model outputs. In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. Compared to noise contrastive estimation, our method is directly comparable in terms of complexity in application. Through self-normalized language model training as well as lattice rescoring experiments, we show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.

show abstract

“…Neural LMs are commonly used in second-pass rescoring [1,2,22,23] or first-pass decoding [3] in ASR systems. While for conventional research-oriented datasets like Switchboard the word-level vocabulary size is several dozens of thousands, for larger systems, especially commercially available systems, the vocabulary size can often go up to several hundred thousand.…”

Section: Related Workmentioning

confidence: 99%

“…Enjoying the benefit of large amounts of text-only training data, language models (LMs) remain an important part of the modern automatic speech recognition (ASR) pipeline [1,2,3]. However, the large quantity of available data is a double-edged sword, posing real challenges in training.…”

Section: Introductionmentioning

confidence: 99%

On Sampling-Based Training Criteria for Neural Language Modeling

Gao

Thulke

Gerstenberger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and crossentropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups.

show abstract

LSTM Language Models for LVCSR in First-Pass Decoding and Lattice-Rescoring

Cited by 10 publications

References 24 publications

Improving Factored Hybrid HMM Acoustic Modeling without State Tying

Improving Factored Hybrid HMM Acoustic Modeling without State Tying

Self-Normalized Importance Sampling for Neural Language Modeling

On Sampling-Based Training Criteria for Neural Language Modeling

Contact Info

Product

Resources

About