VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Wang, Changhan; Rivière, Morgane; Lee, Ann; Wu, Anne; Talnikar, Chaitanya; Haziza, Daniel; Williamson, Mary; Pino, Juan; Dupoux, Emmanuel

doi:10.48550/arxiv.2101.00390

Cited by 26 publications

(33 citation statements)

References 34 publications

(37 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our studies are performed under a controlled setup where the target is clean and synthetic speech with only one female speaker, which is a common setup that previous work in the field used. With the recent release of large-scale S2S dataset [37], we plan to investigate the proposed framework with real data in the future. Another important aspect in generating speech output is the voice and prosody.…”

Section: Discussionmentioning

confidence: 99%

Direct speech-to-speech translation with discrete units

Lee¹,

Chen²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. Previous work [1] addresses the problem by training an attention-based sequence-to-sequence model that maps source speech spectrograms into target spectrograms. To tackle the challenge of modeling continuous spectrogram features of the target speech, we propose to predict the selfsupervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that predicting discrete units and joint speech and text training improve model performance by 11 BLEU compared with a baseline that predicts spectrograms and bridges 83% of the performance gap towards a cascaded system. When trained without any text transcripts, our model achieves similar performance as a baseline that predicts spectrograms and is trained with text data.

show abstract

Section: Discussionmentioning

confidence: 99%

Direct speech-to-speech translation with discrete units

Lee¹,

Chen²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Transferring to Out-of-domain Data We evaluate W2V2 and SEW-D pre-trained models on three additional ASR datasets: TED-LIUM 3 (CC BY-NC-ND 3.0) (Hernandez et al, 2018), Vox-Populi (CC0, CC BY-NC 4.0) (Wang et al, 2021a), and Fisher+Switchboard (LDC200{4,5}S13, LDC200{4,5}T19, LDC97S62) (Godfrey and Holliman, 1993;,) with a similar setup to Hsu et al (2021) (see Appendix B). We use only 10h of supervised audio to stress test low resource domain transfer.…”

Section: Comparison To Published Resultsmentioning

confidence: 99%

“…1 shows the performanceefficiency trade-offs with various model size. SEW-D outperforms W2V2 in most pre-training settings, when experimenting with LibriSpeech (Panayotov et al, 2015), Ted-lium 3 (Hernandez et al, 2018), VoxPopuli (Wang et al, 2021a), and Switchboard (Godfrey and Holliman, 1993) datasets. Pre-trained models and code are available at https://github.com/asappresearch/sew.…”

Section: Introductionmentioning

confidence: 99%

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Wu¹,

Kim²,

Pan³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.

show abstract

“…While Caubrière et al, 2020;Yadav et al, 2020) have shown that E2E models can outperform pipeline approaches in a fully supervised setting, they do not account for improvements in both speech and NLP from self-supervised pre-training and semi-supervised approaches. Shon et al (2021) have introduced new speech NER annotations for the public VoxPopuli corpus (Wang et al, 2021a) and show that E2E models still do not rival pipeline approaches when state-of-the-art pre-trained models such as DeBERTa (He et al, 2020) and wav2vec 2.0 (Baevski et al, 2020) are used. However, their E2E speech NER models are at a disadvantage, since both their E2E and pipeline models use the same pre-trained speech representations while the pipeline also has access to a text model trained on 78GB of text.…”

Section: Spoken Named Entity Recognitionmentioning

confidence: 99%

On the Use of External Data for Spoken Named Entity Recognition

Pasad¹,

Wu²,

Shon³

et al. 2021

Preprint

View full text Add to dashboard Cite

Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond selfsupervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline (speech recognition followed by text NER model) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations alone. Compared to prior work, we find improved F1 scores of up to 16%. While the best baseline model is a pipeline approach, the best performance when using external data is ultimately achieved by an end-to-end model. We provide detailed comparisons and analyses, showing for example that end-to-end models are able to focus on the more NER-specific words.

show abstract

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Cited by 26 publications

References 34 publications

Direct speech-to-speech translation with discrete units

Direct speech-to-speech translation with discrete units

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

On the Use of External Data for Spoken Named Entity Recognition

Contact Info

Product

Resources

About