Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model

Vyas, Apoorv; Madikeri, Srikanth; Bourlard, Hervé

doi:10.21437/interspeech.2021-1683

Cited by 11 publications

(4 citation statements)

References 11 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, only few studies focused on domain shift during pre-training and fine-tuning or the impact of noisy speech on AM [37]. For instance, [14,38,39] perform experiments similar to ours, addressing the domain-shift scenario between pre-training and fine-tuning phases. Yet, these databases still fall into read, spontaneous or conversational speech.…”

Section: Related Workmentioning

confidence: 85%

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez

Prasad

Nigmatulina

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 to 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller fraction of labeled data. We analyze WERs on the low-resource scenario and gender bias carried by one ATC dataset.

show abstract

Section: Related Workmentioning

confidence: 85%

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez

Prasad

Nigmatulina

et al. 2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…However, only few studies focused on domain mismatch or domain shift during pre-training and fine-tuning. For instance, [12,28,29] perform experiments similar to ours, addressing the domain-shift scenario between pre-training and fine-tuning phases.…”

Section: Related Workmentioning

confidence: 99%

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Zuluaga-Gómez¹,

Prasad²,

Nigmatulina³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust endto-end (E2E) acoustic models (AM) that can be later finetuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases (i.e., domain shift). We target this scenario by analyzing the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications. We benchmark the proposed models on four challenging ATC test sets (signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate (WER) reduction between 20% to 40% are obtained in comparison to hybrid-based state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small fraction of labeled data. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.

show abstract

“…Architectures based on convolutional layers and Transformers [17] have been proposed and pre-trained with English datasets, such as wav2vec2.0 [1] and HuBERT [18]. For speech recognition, these pre-trained self-supervised models are usually fine-tuned on the transcribed training data with a standard supervised loss such as Connectionist Temporal Classification (CTC) loss [19], or lattice-free maximum mutual information (LF-MMI) loss [20,21]. A multilingual pre-trained model XLSR-53 [2], which is based on the wav2vec2.0 architecture [1], performs well when used to train ASR models for low-resource languages [2,22,23,24].…”

Section: Self-supervised Trainingmentioning

confidence: 99%

Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models

Lam-Yee-Mui¹,

Yang²,

Klejch³

2023

Interspeech 2023

View full text Add to dashboard Cite

This paper investigates the potential of improving a hybrid automatic speech recognition model trained on 10 hours of transcribed data with 200 hours of untranscribed data in lowresource languages. First, we compare baseline methods of cross-lingual transfer with MFCC features and features extracted with the multilingual self-supervised model XLSR-53. Subsequently, we compare two approaches that can leverage the untranscribed data: semi-supervised training with LF-MMI and continued self-supervised pre-training of XLSR-53. Our results on well-resourced English broadcast data derived from MGB show that both methods achieve 18% and 27% relative improvements compared to the baseline, respectively. On the low-resource South African Soap Opera dataset, the relative improvement with semi-supervised training is only 3% due to the inherently weak language model. However, continued pretraining achieves 8.6% relative improvement because it does not rely on any external information.

show abstract

Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model

Cited by 11 publications

References 11 publications

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models

Contact Info

Product

Resources

About