Low-Resource Speech-to-Text Translation

Bansal, Sameer; Kamper, Herman; Livescu, Karen; Lopez, Adam; Goldwater, Sharon

doi:10.21437/interspeech.2018-1326

Cited by 54 publications

(61 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pre-training can be done in different ways as proposed in the literature. The common way is to use an ASR encoder and an MT decoder to initialize the parameters of the ST network correspondingly [20]. Surprisingly, using an ASR model to pre-train both the encoder and the decoder of the ST model works well [19].…”

Section: Pre-trainingmentioning

confidence: 99%

“…The end-to-end model has advantages over the cascaded pipeline, however, its training requires a moderate amount of paired speech-to-text data which is not easy to acquire. Therefore, recently some techniques such as multitask learning [13,[15][16][17], pre-training different components of the model [18][19][20] and generating synthetic data [21] have been proposed to mitigate the lack of ST parallel training data. These methods aim to use weakly supervised data, i.e.…”

Section: Introductionmentioning

confidence: 99%

“…The pre-training methods and synthesis systems rely on given previously trained models. The component of the ST model can be trained using only an ASR model [19] or both an ASR and an MT model [20]. Similarly, the synthetic techniques depend on a pre-trained MT or a text-to-speech (TTS) synthesis model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives boosts up to 4% in BLEU and 5% in TER. Our experiments are performed on 270h IWSLT TED-talks En→De, and 100h LibriSpeech Audiobooks En→Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.

show abstract

Section: Pre-trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…Examples of source transcripts and original translations with the fluent counterparts are shown below in Table 1. SRC eh, eh, eh, um, yo pienso que es así ORG uh, uh, uh, um, i think it's like that FLT i think it's like that SRC también tengo um eh estoy tomando una clase .. ORG i also have um eh i'm taking a marketing class .. FLT i'm also taking a marketing class SRC porque qué va, mja ya te acuerda que .. ORG because what is, mhm do you recall now that .. FLT do you recall now that .. SRC y entonces am es entonces la universidad donde yo estoy es university of pennsylvania ORG and so am and so the university where i am it's the university of pennsylvania FLT i am at the university of pennsylvania 3 Speech-to-Text Model Initial work on the Fisher-Spanish dataset used HMM-GMM ASR models linked with phrasebased MT using lattices (Post et al, 2013;Kumar et al, 2014). More recently, it was shown in Weiss et al (2017) and Bansal et al (2018) that end-toend SLT models perform competitively on this task. As in Bansal et al (2018), we use a sequence-tosequence architecture inspired by Weiss et al but modified to train within available resources; specifically, all models may be trained in less than 5 days on one GPU.…”

Section: Datamentioning

confidence: 99%

Fluent Translations from Disfluent Speech in End-to-End Speech Translation

Salesky

Sperber

Waibel

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Spoken language translation applications for speech suffer due to conversational speech phenomena, particularly the presence of disfluencies. With the rise of end-to-end speech translation models, processing steps such as disfluency removal that were previously an intermediate step between speech recognition and machine translation need to be incorporated into model architectures. We use a sequence-to-sequence model to translate from noisy, disfluent speech to fluent text with disfluencies removed using the recently collected 'copy-edited' references for the Fisher Spanish-English dataset. We are able to directly generate fluent translations and introduce considerations about how to evaluate success on this task. This work provides a baseline for a new task, the translation of conversational speech with joint removal of disfluencies.

show abstract

“…Low-resource automatic speech-to-text translation (AST) has recently gained traction as a way to bring NLP tools to under-represented languages. An end-to-end approach [1][2][3][4][5][6][7] is particularly appealing for source languages with no written form, or for endangered languages where translations into a high-resource language may be easier to collect than transcriptions [8]. However, building high-quality endto-end AST with little parallel data is challenging, and has led researchers to explore how other sources of data could be used to help.…”

Section: Introductionmentioning

confidence: 99%

Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

Stoian

Bansal

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language. However, it is not clear what factors-e.g., language relatedness or size of the pretraining datayield the biggest improvements, or whether pretraining can be effectively combined with other methods such as data augmentation. Here, we experiment with pretraining on datasets of varying sizes, including languages related and unrelated to the AST source language. We find that the best predictor of final AST performance is the word error rate of the pretrained ASR model, and that differences in ASR/AST performance correlate with how phonetic information is encoded in the later RNN layers of our model. We also show that pretraining and data augmentation yield complementary benefits for AST.

show abstract

Low-Resource Speech-to-Text Translation

Cited by 54 publications

References 27 publications

A Comparative Study on End-to-End Speech to Text Translation

A Comparative Study on End-to-End Speech to Text Translation

Fluent Translations from Disfluent Speech in End-to-End Speech Translation

Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

Contact Info

Product

Resources

About