2020
DOI: 10.48550/arxiv.2006.13979
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
161
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 82 publications
(163 citation statements)
references
References 0 publications
1
161
0
1
Order By: Relevance
“…JUST's improvement over E3 validates the effectiveness of our architecture and joint training scheme. If we exclude English WER and compare other languages as in XLSR [16], JUST outperforms monolingual, XLSR-53, B0, E3 by 36.5%, 31.1%, 19.8%, 11.0% respectively. Compared to JUST with β = 0, JUST with joint training improves the average WER (w/o en) by 7.7% (8.8%).…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…JUST's improvement over E3 validates the effectiveness of our architecture and joint training scheme. If we exclude English WER and compare other languages as in XLSR [16], JUST outperforms monolingual, XLSR-53, B0, E3 by 36.5%, 31.1%, 19.8%, 11.0% respectively. Compared to JUST with β = 0, JUST with joint training improves the average WER (w/o en) by 7.7% (8.8%).…”
Section: Resultsmentioning
confidence: 99%
“…Using LM improves the monolingual performance. XLSR [16] pretrains a w2v2 on 53 languages from MLS, CommonVoice and BABEL, and finetunes the model on MLS. XLSR finetuned on the full set of MLS can outperform some low-resource monolingual baselines like it, pt, pl, but not all (Table 1).…”
Section: Compared Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Since our AD dataset is in Chinese, the cross-lingual speech representation (XLSR) version of wav2vec2.0 [38] is used in this study. After self-supervised training, a randomly initialized projection layer is added on the top of wav2vec2.0 for supervised ASR training.…”
Section: Leveraging the Wav2vec20 Based Asr Modelmentioning
confidence: 99%
“…Finally, multilingual ASR methods include simultaneous training on multiple languages [37][38][39][40] and cascaded approaches in which representations learned from one language are used as initialization for other languages [41,42]. Our approach is similar to the cascaded methods, but it only requires audio-visual data without transcripts.…”
Section: Related Workmentioning
confidence: 99%