Efficient Sequence Training of Attention Models using Approximative Recombination

Wynands, Nils-Philipp; Michel, Wilfried; Rosendahl, Jan; Schlüter, Ralf; Ney, Hermann

doi:10.48550/arxiv.2110.09245

Cited by 3 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, once external language models are included in the training phase, sequence normalization needs to be included explicitly, leading to MMI sequence discriminative training. This has been exploited as a further approach to combine E2E models with external language models trained on text-only data during the training phase itself [128], [129], [130].…”

Section: B Training With External Language Modelsmentioning

confidence: 99%

End-to-End Speech Recognition: A Survey

Prabhavalkar,

Hori,

Sainath

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning has brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures have been introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural ASR models, which rely strongly on general machine learning knowledge, learn more consistently from data, with lower dependence on ASR domainspecific experience. The success and enthusiastic adoption of deep learning, accompanied by more generic model architectures has led to E2E models now becoming the prominent ASR approach. The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements, and to discuss their properties and their relationship to classical hidden Markov model (HMM) based ASR architectures. All relevant aspects of E2E ASR are covered in this work: modeling, training, decoding, and external language model integration, discussions of performance and deployment opportunities, as well as an outlook into potential future developments.

show abstract

Section: B Training With External Language Modelsmentioning

confidence: 99%

End-to-End Speech Recognition: A Survey

Prabhavalkar,

Hori,

Sainath

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…However, once external language models are included in the training phase, sequence normalization needs to be included explicitly, leading to MMI sequence discriminative training. This has been exploited as a further approach to combine E2E models with external language models trained on text-only data already in the training phase [98], [99], [100].…”

Section: B Training With External Language Modelsmentioning

confidence: 99%

“…Finally, Optimal Completion Distillation (OCD) [112] looks to minimize the total edit distance using an efficient dynamic programming algorithm. Finally, another body of research with sequence training introduce a separate external language model at training time, [113], which can also be done efficiently via approximate lattice recombination [99] and also lattice-free approaches [100].…”

Section: Minimum Error Trainingmentioning

confidence: 99%

End-to-End Speech Recognition: A Survey

Prabhavalkar¹,

Hori²,

Sainath³

et al. 2023

Preprint

View full text Add to dashboard Cite

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural ASR models, which rely strongly on general machine learning knowledge, learn more consistently from data, while depending less on ASR domainspecific experience. The success and enthusiastic adoption of deep learning accompanied by more generic model architectures lead to E2E models now becoming the prominent ASR approach. The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements, and to discuss their properties and their relation to the classical hidden Markov model (HMM) based ASR architecture. All relevant aspects of E2E ASR are covered in this work: modeling, training, decoding, and external language model integration, accompanied by discussions of performance and deployment opportunities, as well as an outlook into potential future developments.

show abstract

“…More recently, work in [46] applied LM fusion and internal LM estimation during MWE training of an AED model to improve the N-best approximation. Work in [45] exploited a lattice structure in place of the N-best list to calculate the expected word errors. For RNN-T models, [42] applied the same N-best approximation as in AED to calculate the expected errors.…”

Section: B Mwe Training For End-to-end Asr Modelsmentioning

confidence: 99%

“…The sum is performed over all possible sequences and P (Y |X) is the probability of a specific sequence calculated from the end-to-end ASR model output. As it is intractable to enumerate over all possible sequences and calculate their probabilities, a common practice which has been widely adopted in MWE training for end-toend ASR systems [40]- [45] is to use the N-best hypotheses to approximate the expected word errors, as shown in Eqn. (16).…”

Section: Mbwe Training For Tcpgenmentioning

confidence: 99%

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Sun

Zhang

Woodland

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCP-Gen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thousands of biasing words efficiently into a symbolic prefix-tree, and creates a neural shortcut between the tree and the final ASR output to facilitate the recognition of the biasing words. To enhance TCPGen, we further propose a novel minimum biasing word error (MBWE) loss that directly optimises biasing word errors during training, along with a biasing-word-driven language model discounting (BLMD) method during the test. All contextual ASR systems were evaluated on the public Librispeech audiobook corpus and the data from the dialogue state tracking challenges (DSTC) with the biasing lists extracted from the dialogue-system ontology. Consistent word error rate (WER) reductions were achieved with TCPGen, which were particularly significant on the biasing words with around 40% relative reductions in the recognition error rates. MBWE and BLMD further improved the effectiveness of TCPGen, and achieved more significant WER reductions on the biasing words. TCPGen also achieved zero-shot learning of words not in the audio training set with large WER reductions on the out-of-vocabulary words in the biasing list.

show abstract

Efficient Sequence Training of Attention Models using Approximative Recombination

Cited by 3 publications

References 0 publications

End-to-End Speech Recognition: A Survey

End-to-End Speech Recognition: A Survey

End-to-End Speech Recognition: A Survey

Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator

Contact Info

Product

Resources

About