2021
DOI: 10.48550/arxiv.2101.00390
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semisupervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
31
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 26 publications
(33 citation statements)
references
References 34 publications
(37 reference statements)
0
31
0
Order By: Relevance
“…Our studies are performed under a controlled setup where the target is clean and synthetic speech with only one female speaker, which is a common setup that previous work in the field used. With the recent release of large-scale S2S dataset [37], we plan to investigate the proposed framework with real data in the future. Another important aspect in generating speech output is the voice and prosody.…”
Section: Discussionmentioning
confidence: 99%
“…Our studies are performed under a controlled setup where the target is clean and synthetic speech with only one female speaker, which is a common setup that previous work in the field used. With the recent release of large-scale S2S dataset [37], we plan to investigate the proposed framework with real data in the future. Another important aspect in generating speech output is the voice and prosody.…”
Section: Discussionmentioning
confidence: 99%
“…Transferring to Out-of-domain Data We evaluate W2V2 and SEW-D pre-trained models on three additional ASR datasets: TED-LIUM 3 (CC BY-NC-ND 3.0) (Hernandez et al, 2018), Vox-Populi (CC0, CC BY-NC 4.0) (Wang et al, 2021a), and Fisher+Switchboard (LDC200{4,5}S13, LDC200{4,5}T19, LDC97S62) (Godfrey and Holliman, 1993;,) with a similar setup to Hsu et al (2021) (see Appendix B). We use only 10h of supervised audio to stress test low resource domain transfer.…”
Section: Comparison To Published Resultsmentioning
confidence: 99%
“…1 shows the performanceefficiency trade-offs with various model size. SEW-D outperforms W2V2 in most pre-training settings, when experimenting with LibriSpeech (Panayotov et al, 2015), Ted-lium 3 (Hernandez et al, 2018), VoxPopuli (Wang et al, 2021a), and Switchboard (Godfrey and Holliman, 1993) datasets. Pre-trained models and code are available at https://github.com/asappresearch/sew.…”
Section: Introductionmentioning
confidence: 99%
“…While Caubrière et al, 2020;Yadav et al, 2020) have shown that E2E models can outperform pipeline approaches in a fully supervised setting, they do not account for improvements in both speech and NLP from self-supervised pre-training and semi-supervised approaches. Shon et al (2021) have introduced new speech NER annotations for the public VoxPopuli corpus (Wang et al, 2021a) and show that E2E models still do not rival pipeline approaches when state-of-the-art pre-trained models such as DeBERTa (He et al, 2020) and wav2vec 2.0 (Baevski et al, 2020) are used. However, their E2E speech NER models are at a disadvantage, since both their E2E and pipeline models use the same pre-trained speech representations while the pipeline also has access to a text model trained on 78GB of text.…”
Section: Spoken Named Entity Recognitionmentioning
confidence: 99%