Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

Stoian, Mihaela C.; Bansal, Sameer; Goldwater, Sharon

doi:10.1109/icassp40776.2020.9053847

Cited by 42 publications

(24 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To avoid this problem and for better efficiency, end-to-end ST models are proposed and become popular in recent years (Berard et al, 2016(Berard et al, , 2018Bansal et al, 2018;. To alleviate the data scarcity problem of end-to-end ST models, various techniques are utilized, including pre-training (Bansal et al, 2019), multi-task learning (Anastasopoulos and Chiang, 2018), knowledge distillation Ren et al, 2020), data synthesis (Jia et al, 2019), self-supervised learning and speech augmentation techniques like SpecAugment (Bahar et al, 2019) or speed perturbation (Stoian et al, 2020). Some studies focus on how to bridge the gap between different modalities (speech and text) or different modules (acoustic and semantic modeling).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Zeng¹,

Li²,

Li³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in realtime, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models as well as cascaded models in diverse latency settings.

show abstract

Section: Related Workmentioning

confidence: 99%

“…To enhance the CTC quality, we also apply a pretraining procedure (Stoian et al, 2020). We only use CTC loss to pre-train the acoustic encoder 2 .…”

Section: Training Proceduresmentioning

confidence: 99%

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Zeng¹,

Li²,

Li³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…In a similar way, more recent work pre-trains different components of the ST system, and consolidates them into one. For example, one can initialize the encoder with an ASR model, and initialize the decoder with the target-language side of an MT model (Berard et al, 2018;Bansal et al, 2019;Stoian et al, 2020). More sophisticated methods include better training and fine-tuning (Wang et al, 2020a,b), the shrink mechanism , the adversarial regularizer (Alinejad and Sarkar, 2020), and etc.…”

Section: Related Workmentioning

confidence: 99%

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Xu¹,

Hu²,

Li³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Encoder pre-training is promising in end-toend Speech Translation (ST), given the fact that speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic Speech Recognition (ASR) or Machine Translation (MT) encoders. For example, we find that ASR encoders lack the global context representation, which is necessary for translation, whereas MT encoders are not designed to deal with long but locally attentive acoustic sequences. In this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for speech translation. Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an MT encoder for a global representation of the input sequence. In this way, it is straightforward to incorporate the pre-trained models into the system. Also, we develop an adaptor module to alleviate the representation inconsistency between the pre-trained ASR encoder and MT encoder, and develop a multi-teacher knowledge distillation method to preserve the pre-training knowledge. Experimental results on the LibriSpeech En-Fr and MuST-C En-De ST tasks show that our method achieves state-of-the-art BLEU scores of 18.3 and 25.2. To our knowledge, we are the first to develop an end-to-end ST system that achieves comparable or even better BLEU performance than the cascaded ST counterpart when large-scale ASR and MT data is available 1 .

show abstract

“…The system is trained only on transcribed SLT data, with two auxiliary tasks: pretraining the encoder and decoder with ASR and textual MT respectively. Stoian et al (2019) compare the effects of pretraining on auxiliary ASR datasets of different languages and sizes, concluding that the WER of the ASR system is more predictive of the final translation quality than language relatedness. Anastasopoulos and Chiang (2018) make the line between pipeline and end-toend approaches more blurred by using a multi-task learning setup with two-step decoding.…”

Section: End-to-end Spoken Language Translationmentioning

confidence: 99%

Multimodal machine translation through visuals and speech

Sulubacak

Çağlayan

Grönroos

et al. 2020

Machine Translation

View full text Add to dashboard Cite

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

show abstract

Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

Cited by 42 publications

References 23 publications

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Multimodal machine translation through visuals and speech

Contact Info

Product

Resources

About