ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414227
|View full text |Cite
|
Sign up to set email alerts
|

Joint Masked CPC And CTC Training For ASR

Abstract: Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec 2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(4 citation statements)
references
References 13 publications
(14 reference statements)
0
4
0
Order By: Relevance
“…Most similar to our work is the speech recognition system wav2vec2.0 (Baevski et al, 2020), which uses the same two phase training setup with a self-supervised contrastive loss during pre-training and Connectionist Temporal Classification (CTC) loss on transcribed speech data during fine-tuning. Talnikar et al (2020) presents that the self-supervised loss regularizes the supervised loss during joint learning of both objectives. Follow up work has shown the usefulness of the pre-trained speech representations for exploring speech variation (Bartelds et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Most similar to our work is the speech recognition system wav2vec2.0 (Baevski et al, 2020), which uses the same two phase training setup with a self-supervised contrastive loss during pre-training and Connectionist Temporal Classification (CTC) loss on transcribed speech data during fine-tuning. Talnikar et al (2020) presents that the self-supervised loss regularizes the supervised loss during joint learning of both objectives. Follow up work has shown the usefulness of the pre-trained speech representations for exploring speech variation (Bartelds et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…During fine-tuning, we only update the parameters of the Transformer based context network following the wav2vec2.0 [5]. Fine-tuning wav2vec2.0 on labeled data with CTC objective [16] has been well verified [30,31,32]. However, according to the work [6], the results of fine-tuning wav2vec2.0 based on a vanilla Transformer S2S ASR model with cross-entropy criterion can only achieve a very limited result.…”
Section: Encoder (W2v-encoder)mentioning
confidence: 99%
“…First the models are pre-trained on thousands of hours of unlabeled speech, and then they are further adapted by fine-tuning on the actual task of automatic speech recognition (ASR) using a smaller supervised set. However, because the pre-training (PT) phase is task agnostic, self-supervision can under-perform on a specific downstream task (Talnikar et al, 2021;Dery et al, 2022). Further, SSL pre-training leads to a more complicated pipeline involving multiple phases.…”
Section: Introductionmentioning
confidence: 99%