ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053847
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

Abstract: Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language. However, it is not clear what factors-e.g., language relatedness or size of the pretraining datayield the biggest improvements, or whether pretraining can be effectively combined with other methods such as data augmentation. Here, we experiment with pretraining on datasets of varyin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 42 publications
(24 citation statements)
references
References 23 publications
0
15
0
Order By: Relevance
“…To avoid this problem and for better efficiency, end-to-end ST models are proposed and become popular in recent years (Berard et al, 2016(Berard et al, , 2018Bansal et al, 2018;. To alleviate the data scarcity problem of end-to-end ST models, various techniques are utilized, including pre-training (Bansal et al, 2019), multi-task learning (Anastasopoulos and Chiang, 2018), knowledge distillation Ren et al, 2020), data synthesis (Jia et al, 2019), self-supervised learning and speech augmentation techniques like SpecAugment (Bahar et al, 2019) or speed perturbation (Stoian et al, 2020). Some studies focus on how to bridge the gap between different modalities (speech and text) or different modules (acoustic and semantic modeling).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…To avoid this problem and for better efficiency, end-to-end ST models are proposed and become popular in recent years (Berard et al, 2016(Berard et al, , 2018Bansal et al, 2018;. To alleviate the data scarcity problem of end-to-end ST models, various techniques are utilized, including pre-training (Bansal et al, 2019), multi-task learning (Anastasopoulos and Chiang, 2018), knowledge distillation Ren et al, 2020), data synthesis (Jia et al, 2019), self-supervised learning and speech augmentation techniques like SpecAugment (Bahar et al, 2019) or speed perturbation (Stoian et al, 2020). Some studies focus on how to bridge the gap between different modalities (speech and text) or different modules (acoustic and semantic modeling).…”
Section: Related Workmentioning
confidence: 99%
“…To enhance the CTC quality, we also apply a pretraining procedure (Stoian et al, 2020). We only use CTC loss to pre-train the acoustic encoder 2 .…”
Section: Training Proceduresmentioning
confidence: 99%
“…In a similar way, more recent work pre-trains different components of the ST system, and consolidates them into one. For example, one can initialize the encoder with an ASR model, and initialize the decoder with the target-language side of an MT model (Berard et al, 2018;Bansal et al, 2019;Stoian et al, 2020). More sophisticated methods include better training and fine-tuning (Wang et al, 2020a,b), the shrink mechanism , the adversarial regularizer (Alinejad and Sarkar, 2020), and etc.…”
Section: Related Workmentioning
confidence: 99%
“…The system is trained only on transcribed SLT data, with two auxiliary tasks: pretraining the encoder and decoder with ASR and textual MT respectively. Stoian et al (2019) compare the effects of pretraining on auxiliary ASR datasets of different languages and sizes, concluding that the WER of the ASR system is more predictive of the final translation quality than language relatedness. Anastasopoulos and Chiang (2018) make the line between pipeline and end-toend approaches more blurred by using a multi-task learning setup with two-step decoding.…”
Section: End-to-end Spoken Language Translationmentioning
confidence: 99%