Unsupervised Cross-lingual Representation Learning for Speech Recognition

Conneau, Alexis; Baevski, Alexei; Collobert, Ronan; Mohamed, Abdelrahman; Auli, Michael

doi:10.48550/arxiv.2006.13979

Cited by 82 publications

(163 citation statements)

References 0 publications

Supporting

Mentioning

161

Contrasting

Unclassified

Order By: Relevance

“…JUST's improvement over E3 validates the effectiveness of our architecture and joint training scheme. If we exclude English WER and compare other languages as in XLSR [16], JUST outperforms monolingual, XLSR-53, B0, E3 by 36.5%, 31.1%, 19.8%, 11.0% respectively. Compared to JUST with β = 0, JUST with joint training improves the average WER (w/o en) by 7.7% (8.8%).…”

Section: Resultsmentioning

confidence: 99%

“…Using LM improves the monolingual performance. XLSR [16] pretrains a w2v2 on 53 languages from MLS, CommonVoice and BABEL, and finetunes the model on MLS. XLSR finetuned on the full set of MLS can outperform some low-resource monolingual baselines like it, pt, pl, but not all (Table 1).…”

Section: Compared Methodsmentioning

confidence: 99%

“…[14] explores the unsupervised pretraining using cross-lingual language modeling and [15] investigates the cross-lingual transfer of phoneme features. XLSR [16] builds on w2v2 and pretrains the model on 53 languages using the self-supervised losses. XLSR also stands for the state-of-the-art (SOTA) on MLS dataset.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Joint Unsupervised and Supervised Training for Multilingual ASR

Bai¹,

Li²,

Bapna³

et al. 2021

Preprint

View full text Add to dashboard Cite

Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Compared Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Joint Unsupervised and Supervised Training for Multilingual ASR

Bai¹,

Li²,

Bapna³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Since our AD dataset is in Chinese, the cross-lingual speech representation (XLSR) version of wav2vec2.0 [38] is used in this study. After self-supervised training, a randomly initialized projection layer is added on the top of wav2vec2.0 for supervised ASR training.…”

Section: Leveraging the Wav2vec20 Based Asr Modelmentioning

confidence: 99%

Exploiting Pre-Trained ASR Models for Alzheimer's Disease Recognition Through Spontaneous Speech

Qin¹,

Liu²,

Peng³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Finally, multilingual ASR methods include simultaneous training on multiple languages [37][38][39][40] and cascaded approaches in which representations learned from one language are used as initialization for other languages [41,42]. Our approach is similar to the cascaded methods, but it only requires audio-visual data without transcripts.…”

Section: Related Workmentioning

confidence: 99%