Joint Masked CPC And CTC Training For ASR

Talnikar, Chaitanya; Likhomanenko, T.; Collobert, Ronan; Synnaeve, Gabriel

doi:10.1109/icassp39728.2021.9414227

Cited by 20 publications

(4 citation statements)

References 13 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most similar to our work is the speech recognition system wav2vec2.0 (Baevski et al, 2020), which uses the same two phase training setup with a self-supervised contrastive loss during pre-training and Connectionist Temporal Classification (CTC) loss on transcribed speech data during fine-tuning. Talnikar et al (2020) presents that the self-supervised loss regularizes the supervised loss during joint learning of both objectives. Follow up work has shown the usefulness of the pre-trained speech representations for exploring speech variation (Bartelds et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription

Vogler¹,

Allen²,

Miller³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English printed documents, we show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language modelstyle pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations invariant to scribal writing style and printing noise present across documents.

show abstract

Section: Related Workmentioning

confidence: 99%

Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription

Vogler¹,

Allen²,

Miller³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

show abstract

“…During fine-tuning, we only update the parameters of the Transformer based context network following the wav2vec2.0 [5]. Fine-tuning wav2vec2.0 on labeled data with CTC objective [16] has been well verified [30,31,32]. However, according to the work [6], the results of fine-tuning wav2vec2.0 based on a vanilla Transformer S2S ASR model with cross-entropy criterion can only achieve a very limited result.…”

Section: Encoder (W2v-encoder)mentioning

confidence: 99%

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Deng¹,

Cao²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a 4.6% character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields 27% relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

show abstract

“…First the models are pre-trained on thousands of hours of unlabeled speech, and then they are further adapted by fine-tuning on the actual task of automatic speech recognition (ASR) using a smaller supervised set. However, because the pre-training (PT) phase is task agnostic, self-supervision can under-perform on a specific downstream task (Talnikar et al, 2021;Dery et al, 2022). Further, SSL pre-training leads to a more complicated pipeline involving multiple phases.…”

Section: Introductionmentioning

confidence: 99%

Continuous Pseudo-Labeling from the Start

Berrebbi¹,

Collobert²,

Bengio³

et al. 2022

Preprint

View full text Add to dashboard Cite

Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform 'continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.

show abstract

Joint Masked CPC And CTC Training For ASR

Cited by 20 publications

References 13 publications

Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription

Lacuna Reconstruction: Self-Supervised Pre-Training for Low-Resource Historical Document Transcription

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Continuous Pseudo-Labeling from the Start

Contact Info

Product

Resources

About