Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Kannan, Anjuli; Datta, Arindrima; Sainath, Tara N.; Weinstein, Eugene; Ramabhadran, Bhuvana; Wu, Yonghui; Bapna, Ankur; Chen, Zhifeng; Lee, Seungji

doi:10.21437/interspeech.2019-2858

Cited by 125 publications

(65 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Under the context of ASR systems, multilingual systems [2,3,4] are developed to capture and utilize this information in common to facilitate ASR systems for low-resource languages. Recently, several large-scale systems have been introduced for multilingual ASR [5]. Pretap et al [6] introduced a massive single E2E model with up to 1 billion parameters trained on 50 languages.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Hou

Wang

Gao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transfer learning from a multilingual model has shown favorable results on low-resource automatic speech recognition (ASR). However, full-model fine-tuning generates a separate model for every target language and is not suitable for deploying and maintaining in production. The key challenge lies in how to efficiently extend the pre-trained model with fewer parameters. In this paper, we propose to combine the adapter module with meta-learning algorithms to achieve high recognition performance under low-resource settings and improve the parameter-efficiency of the model. Extensive experiments show that our methods can achieve comparable or even superior recognition rates than the state-of-the-art baselines on low-resource languages, especially under very-low-resource conditions, with a significantly smaller model profile.

show abstract

Section: Introductionmentioning

confidence: 99%

“…[11] introduced the adapter module for parameter-efficient domain adaptation in machine translation, where only few parameters are introduced for each target domain. In [5,12], the authors used language-specific adapters to enhance the performance on each language for a multilingual ASR model.…”

Section: Introductionmentioning

confidence: 99%

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Hou

Wang

Gao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…With the exception of a few recent works [5,6,7], most previous work on multilingual speech recognition focuses on the benefits of these models for lower-resource or related languages. Nevertheless, in order for these models to be utilized in real-world scenarios and replace their monolingual counterparts, they need to target a variety of languages, with large • Introduction of an informed mixture-of-experts layer, used in the encoder of an RNN-T model, where each expert is assigned to one language, or set of related languages.…”

Section: Introductionmentioning

confidence: 99%

“…A variety of approaches have been explored for changing the structure of the neural network model to make it more amenable to multilingual modeling. In the context of encoderdecoder models, [5] used adapter layers to account for different amounts of available data per language, [9] parameterized the attention heads of a Transformer-based encoder to be per-language, while [6] showed that a multi-decoder multilingual model, where each decoder is assigned to a cluster of languages, can achieve good performance. Since the introduction of Mixture of Experts (MOE) in [10], these models have found popularity in machine translation [8], and speech recognition [11,12].…”

Section: Introductionmentioning

confidence: 99%

Mixture of Informed Experts for Multilingual Speech Recognition

Gaur

Farris

Haghani

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

When trained on related or low-resource languages, multilingual speech recognition models often outperform their monolingual counterparts. However, these models can suffer from loss in performance for high resource or unrelated languages. We investigate the use of a mixture-of-experts approach to assign per-language parameters in the model to increase network capacity in a structured fashion. We introduce a novel variant of this approach, 'informed experts', which attempts to tackle inter-task conflicts by eliminating gradients from other tasks in these task-specific parameters. We conduct experiments on a real-world task with English, French and four dialects of Arabic to show the effectiveness of our approach. Our model matches or outperforms the monolingual models for almost all languages, with gains of as much as 31% relative. Our model also outperforms the baseline multilingual model for all languages by up to 9% relative.Index Termsend-to-end speech recognition, multilingual, RNN-T, language id, mixture of experts * The first two authors have equal contribution. The rest of the list is sorted alphabetically.variations in amounts of training data. In this paper, we propose one multilingual model to transcribe languages with varied amounts of training data. We use the mixture-of-experts approach (MOE) [8] and adapt it to exploit the inherent structure of the data to simultaneously learn per-language experts.

show abstract

“…RNN-T overcomes the conditional independence assumption of CTC with the prediction network; moreover, it allows streaming ASR because it still preforms frame-level monotonic decoding. Hence, there has been a significant research effort in promoting this approach in the ASR community [22,21,25,26,27], and RNN-T has recently been successfully deployed in embedding devices [28].…”

Section: Introductionmentioning

confidence: 99%

Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition

Zhao

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

show abstract

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Cited by 125 publications

References 33 publications

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Meta-Adapter: Efficient Cross-Lingual Adaptation With Meta-Learning

Mixture of Informed Experts for Multilingual Speech Recognition

Exploring Pre-Training with Alignments for RNN Transducer Based End-to-End Speech Recognition

Contact Info

Product

Resources

About