Massively Multilingual ASR: A Lifelong Learning Solution

Li, Bo; Pang, Ruoming; Sainath, Tara N.; Strohman, Trevor; Haghani, Parisa; Zhu, Yun; Farris, Brian; Gaur, Neeraj; Prasad, Manasa

doi:10.1109/icassp43922.2022.9746594

Cited by 25 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although WPMs are more commonly adopted over graphemes for monolingual ASR, an output layer with multilingual WPMs generated by pooling all monolingual data together can often be overly large when many languages and writing scripts are integrated [28,29]. Separate monolingual output layers can be used as a solution, which can be dated back to the previous works with phonemes and graphemes [30,31].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang¹,

Li²,

Sainath³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascadedencoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra testtime cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same secondpass WER as that obtained by including oracle LID in the input.

show abstract

Section: Related Workmentioning

confidence: 99%

“…• First, the UML allows multilingual ASR to scale gracefully to any number of languages without increasing the output layer size [28,29]. This is smaller in size than the conventional multilingual output layer and improves the computation efficiency in both RNN-T training and decoding.…”

Section: A Universal Monolingual Output Layermentioning

confidence: 99%

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang¹,

Li²,

Sainath³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…As no label information is needed, this approach can easily scale up for more diverse speech data without human transcription effort involved. With supervised multitask learning similar to [13], different tasks are unified into a heterogeneous discriminative task and the model is trained jointly on these tasks, such as multi-domain tasks [5,18] or multilingual tasks [19,20]. A prerequisite of this approach is to have some labeled data for tasks that the FMs are trained on.…”

Section: Introductionmentioning

confidence: 99%

Efficient Domain Adaptation for Speech Foundation Models

Li¹,

Hwang²,

Huo³

et al. 2023

Preprint

View full text Add to dashboard Cite

Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.

show abstract

“…Recent advances [1,2,3,4,5,6] in developing large-scale ASR architectures have demonstrated promising results for English speech recognition tasks. Moreover, English ASR model with self-supervised training objectives, such as wav2vec2 [7], w2v-BERT [8], and BigSSL [9], further boosts recognition performance, as an extension from the existing supervised ASR framework with annotated data.…”

Section: Introductionmentioning

confidence: 99%

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Yang¹,

Li²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can re-purpose well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.

show abstract

Massively Multilingual ASR: A Lifelong Learning Solution

Cited by 25 publications

References 24 publications

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Efficient Domain Adaptation for Speech Foundation Models

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Contact Info

Product

Resources

About