Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1835
|View full text |Cite
|
Sign up to set email alerts
|

Investigating Self-Supervised Pre-Training for End-to-End Speech Translation

Abstract: Self-supervised learning from raw speech has been proven beneficial to improve automatic speech recognition (ASR). We investigate here its impact on end-to-end automatic speech translation (AST) performance. We use a contrastive predictive coding (CPC) model pre-trained from unlabeled speech as a feature extractor for a downstream AST task. We show that selfsupervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves perfor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(24 citation statements)
references
References 24 publications
0
22
0
Order By: Relevance
“…Self-supervised pre-training for speech In speech, wav2vec (Schneider et al, 2019) leverages contrastive learning to produce contextual representations for audio input; vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b) further propose to discretize the original continuous audio signals in order to enable more efficient MLM training with Transformer (Vaswani et al, 2017). Pre-trained speech models have been applied to ASR Baevski et al, 2020b), phoneme recognition (Song et al, 2020;Liu et al, 2020a), speech translation (Nguyen et al, 2020;Chung et al, 2019c), and speech synthesis (Chung et al, 2019b), to name a few.…”
Section: Related Workmentioning
confidence: 99%
“…Self-supervised pre-training for speech In speech, wav2vec (Schneider et al, 2019) leverages contrastive learning to produce contextual representations for audio input; vq-wav2vec (Baevski et al, 2020a) and wav2vec 2.0 (Baevski et al, 2020b) further propose to discretize the original continuous audio signals in order to enable more efficient MLM training with Transformer (Vaswani et al, 2017). Pre-trained speech models have been applied to ASR Baevski et al, 2020b), phoneme recognition (Song et al, 2020;Liu et al, 2020a), speech translation (Nguyen et al, 2020;Chung et al, 2019c), and speech synthesis (Chung et al, 2019b), to name a few.…”
Section: Related Workmentioning
confidence: 99%
“…As a variant of transfer learning, self-supervised learning of speech or language representations has been proposed these last few years, for instance with the BERT system [13], used for textual representation. Such representations, computed by neural models trained on huge amounts of unlabeled data, have shown their effectiveness on some tasks under certain conditions, for instance for computer vision [14] and Natural Language Processing (NLP) tasks such as Single-Sentence Classification, Text Similarity, Relevance Ranking [15,16,17], ASR [18,19], or speech translation [20].…”
Section: Related Workmentioning
confidence: 99%
“…From the results of Tables 1 and 2 In addition to the experiments from Table 1 we generated two ensemble models [10] using only wav2vec features. The improved results of combining models 3+4 and models 7+8 are shown in Table 3.…”
Section: Resultsmentioning
confidence: 99%
“…In NLP, BERT [17], UniLM [18] and BART [19] have been successfully applied in language inference [20], question answering [21] and summarization [22]. In speech processing APC [1], wav2vec [16], vq-wav2vec [14], wav2vec 2.0 [15], Mockingjay [11], Tera [13], and audio Albert [4] have been applied for tasks, such as phoneme recognition, automatic speech recognition, speaker recognition, sentiment analysis and speech to text translation [10]. However, to the best of our knowledge, none of these self-supervised speech pre-training techniques have been applied to the field of SLU.…”
Section: Introductionmentioning
confidence: 99%