Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1683
|View full text |Cite
|
Sign up to set email alerts
|

Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model

Abstract: In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including outof-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 11 publications
(14 reference statements)
0
4
0
Order By: Relevance
“…However, only few studies focused on domain shift during pre-training and fine-tuning or the impact of noisy speech on AM [37]. For instance, [14,38,39] perform experiments similar to ours, addressing the domain-shift scenario between pre-training and fine-tuning phases. Yet, these databases still fall into read, spontaneous or conversational speech.…”
Section: Related Workmentioning
confidence: 85%
“…However, only few studies focused on domain shift during pre-training and fine-tuning or the impact of noisy speech on AM [37]. For instance, [14,38,39] perform experiments similar to ours, addressing the domain-shift scenario between pre-training and fine-tuning phases. Yet, these databases still fall into read, spontaneous or conversational speech.…”
Section: Related Workmentioning
confidence: 85%
“…However, only few studies focused on domain mismatch or domain shift during pre-training and fine-tuning. For instance, [12,28,29] perform experiments similar to ours, addressing the domain-shift scenario between pre-training and fine-tuning phases.…”
Section: Related Workmentioning
confidence: 99%
“…Architectures based on convolutional layers and Transformers [17] have been proposed and pre-trained with English datasets, such as wav2vec2.0 [1] and HuBERT [18]. For speech recognition, these pre-trained self-supervised models are usually fine-tuned on the transcribed training data with a standard supervised loss such as Connectionist Temporal Classification (CTC) loss [19], or lattice-free maximum mutual information (LF-MMI) loss [20,21]. A multilingual pre-trained model XLSR-53 [2], which is based on the wav2vec2.0 architecture [1], performs well when used to train ASR models for low-resource languages [2,22,23,24].…”
Section: Self-supervised Trainingmentioning
confidence: 99%