Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

Bansal, Sameer; Kamper, Herman; Livescu, Karen; Lopez, Adam; Goldwater, Sharon

doi:10.18653/v1/n19-1006

Cited by 142 publications

(160 citation statements)

References 34 publications

(30 reference statements)

Supporting

Mentioning

160

Contrasting

Order By: Relevance

“…The common way is to use an ASR encoder and an MT decoder to initialize the parameters of the ST network correspondingly [20]. Surprisingly, using an ASR model to pre-train both the encoder and the decoder of the ST model works well [19]. [30], we automatically recompute the provided audio-to-source-sentence alignments to reduce the problem of speech segments without a translation.…”

Section: Pre-trainingmentioning

confidence: 99%

“…This huge degradation has led to further investigations where we study why pre-training of text decoder using an MT model hurts. To explain this justification, we first try pre-training both the encoder and the decoder using our ASR model as suggested in [19]. Since the ASR decoder is already familiar with the ASR encoder, this problem should be disappeared.…”

Section: Pre-trainingmentioning

confidence: 99%

“…The end-to-end model has advantages over the cascaded pipeline, however, its training requires a moderate amount of paired speech-to-text data which is not easy to acquire. Therefore, recently some techniques such as multitask learning [13,[15][16][17], pre-training different components of the model [18][19][20] and generating synthetic data [21] have been proposed to mitigate the lack of ST parallel training data. These methods aim to use weakly supervised data, i.e.…”

Section: Introductionmentioning

confidence: 99%

“…The pre-training methods and synthesis systems rely on given previously trained models. The component of the ST model can be trained using only an ASR model [19] or both an ASR and an MT model [20]. Similarly, the synthetic techniques depend on a pre-trained MT or a text-to-speech (TTS) synthesis model.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives boosts up to 4% in BLEU and 5% in TER. Our experiments are performed on 270h IWSLT TED-talks En→De, and 100h LibriSpeech Audiobooks En→Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.

show abstract

Section: Pre-trainingmentioning

confidence: 99%

Section: Pre-trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…One of the most recent and successful data augmentation methods, SpecAugment [3], modifies the spectrogram with time warping, frequency masking and time masking. AST methods to leverage ASR and MT data include pretraining [4], multitask learning [5] and weakly supervised data augmentation [6,7].…”

Section: Introductionmentioning

confidence: 99%

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

McCarthy

Puzon

Pino

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English-French and English-Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augmented data. Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-toend AST model that outperforms a very strong cascade model on an English-French AST task. Our method is sufficiently general that it can be applied to other speech generation and analysis tasks.

show abstract