2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003790
|View full text |Cite
|
Sign up to set email alerts
|

A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition

Abstract: This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the H… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
66
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 79 publications
(67 citation statements)
references
References 28 publications
1
66
0
Order By: Relevance
“…It has been shown to be faster than time-synchronous search for the same accuracy. Density ratio (DR) LM fusion [13] is a shallow fusion technique that combines two language models: an external LM trained on a target domain corpus and a language model trained on the acoustic transcripts (source domain) only. The latter is used to subtract the effect of the intrinsic LM given by the prediction network (idea further developed in [14]).…”
Section: Training and Decoding Recipementioning
confidence: 99%
See 1 more Smart Citation
“…It has been shown to be faster than time-synchronous search for the same accuracy. Density ratio (DR) LM fusion [13] is a shallow fusion technique that combines two language models: an external LM trained on a target domain corpus and a language model trained on the acoustic transcripts (source domain) only. The latter is used to subtract the effect of the intrinsic LM given by the prediction network (idea further developed in [14]).…”
Section: Training and Decoding Recipementioning
confidence: 99%
“…This led to a rapidly evolving research landscape in end-to-end modeling for ASR with Recurrent Neural Network Transducers (RNN-T) [1] and attention-based models [2,3] being the most prominent examples. Attention based models are excellent at handling non-monotonic alignment problems such as translation [4], whereas RNN-Ts are an ideal match for the left-to-right nature of speech [5][6][7][8][9][10][11][12][13][14][15][16][17].…”
Section: Introductionmentioning
confidence: 99%
“…However, shallow fusion does not have a clear probabilistic interpretation. McDermott et al [282] proposed a density ratio approach based on Bayes' rule. An LM is built on text transcripts from the training set which has paired speech and text data, and a second LM is built on the target domain.…”
Section: Language Model Adaptationmentioning
confidence: 99%
“…Previous studies have improved the tail performance of an E2E ASR system by combining shallow fusion with MWER fine-tuning [3], or with a density ratio approach for LM fusion [5]. These methods incorporate extra language models during decoding, thus increasing the amount of computation.…”
Section: Introductionmentioning
confidence: 99%