Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching

Punjabi, Surabhi; Arsikere, Harish; Raeesy, Zeynab; Chandak, Chander; Bhave, Nikhil; Bansal, Ankish; Müller, Markus; Murillo, Sergio; Rastrow, Ariya; Stolcke, Andreas; Droppo, Jasha; Garimella, Sri; Maas, Roland; Hans, Mat; Mouchtaris, Athanasios; Kunzmann, Siegfried

doi:10.1109/icassp39728.2021.9413734

Cited by 13 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…x 1:t < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 2 g 5 G J 3 It is known that acoustic and linguistic information can be combined to improve LID prediction [9,12,28]. The concatenation of e enc1 r:t+r and e enc2 1:t allows the LID predictor to leverage such complementary information easily.…”

Section: Joining a Lid Predictor With Cascaded Encodersmentioning

confidence: 99%

“…Our work differs from this as we predict the LIDs instead of using the oracle ones. Another body of work looks at techniques to predict LID and use the predictions in the ASR system or downstream tasks [21][22][23][24][25][26][27][28][29]. Much of this work focuses on LID predictions in a non-streaming system [23][24][25][26][30][31][32], which does not fit into our streaming ASR setup that is important due to production constraints.…”

Section: Introductionmentioning

confidence: 99%

“…Much of this work focuses on LID predictions in a non-streaming system [23][24][25][26][30][31][32], which does not fit into our streaming ASR setup that is important due to production constraints. Our work also differs from existing streaming work [27][28][29], as we use an RNN-T cascaded encoder model with a 0.9-second delay in the second-pass, making it suitable for accurate LID predictions. Furthermore, our LID predictor is lightweight as it uses non-parametric streaming statistics pooling and increases parameters by only 0.5%.…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, our LID predictor is lightweight as it uses non-parametric streaming statistics pooling and increases parameters by only 0.5%. Finally, compared to previous works, our experiments were conducted on a largerscale of 9 language locales [27][28][29].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang¹,

Li²,

Sainath³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascadedencoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra testtime cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same secondpass WER as that obtained by including oracle LID in the input.

show abstract

Section: Joining a Lid Predictor With Cascaded Encodersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang¹,

Li²,

Sainath³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…The resulting E2E model can perform utterance-based multilingual ASR. The works in [4] [5] [6] [7] aim to build an E2E model that can improve code switching. While these approaches are different from each other, there are some similarities among them.…”

Section: Introductionmentioning

confidence: 99%

Bilingual End-to-End ASR with Byte-Level Subwords

Deng¹,

Hsiao²,

Ghoshal³

2022

Preprint

View full text Add to dashboard Cite

In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and bytelevel byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-toend model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on English and Mandarin dictation tasks, and we find that BBPE with penalty schemes can improve utterance-based bilingual ASR performance by 2% to 5% relative even with smaller number of outputs and fewer parameters. We conclude with analysis that indicates directions for further improving multilingual ASR.

show abstract

Review of Automatic Speech Recognition Systems for Ukrainian and English Language

Dumyn,

Fedushko,

Syerov

2024

Lecture Notes on Data Engineering and Communications Technologies

View full text Add to dashboard Cite

Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching

Cited by 13 publications

References 21 publications

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Bilingual End-to-End ASR with Byte-Level Subwords

Review of Automatic Speech Recognition Systems for Ukrainian and English Language

Contact Info

Product

Resources

About