A Comparative Study on End-to-End Speech to Text Translation

Bahar, Parnia; Bieschke, Tobias; Ney, Hermann

doi:10.1109/asru46091.2019.9003774

Cited by 72 publications

(80 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After initialization with pre-trained components, we fine-tune on the DST training data. As proposed in (Bahar et al, 2019a), in order to familiarize the pre-trained text decoder with the output of the pre-trained speech encoder, we insert an additional adaptor layer which is a BiLSTM layer between the encoder and decoder. We train the adaptor component jointly without freezing the parameters in the fine-tuning stage.…”

Section: End-to-end Direct Speech Translationmentioning

confidence: 99%

“…The ST encoder is initialized with the ASR encoder (except for the additional 3 layers that are initialized with random values). The decision of having a different number of encoder layers in the two encoders is motivated by the idea of introducing adaptation layers, which (Bahar et al, 2019a) reported to be essential when initializing the decoder with that of a pretrained MT model.…”

Section: Architecturesmentioning

confidence: 99%

“…the 8th layer. In this way, the ST encoder has three additional layers which can transform the representation into features which are more convenient for the ST task, as Bahar et al (2019a) did introducing an adaptation layer.…”

Section: Multi-task Trainingmentioning

confidence: 99%

“…A traditional approach consists in pretraining components: the ST encoder is initialized with the ASR encoder and the ST decoder with the MT decoder. The encoder pretraining has indeed proved to be effective (Bansal et al, 2019), while the decoder pretraining has not demonstrated to be as effective, unless with the addition of adaptation layers (Bahar et al, 2019a). A more promising way to transfer knowledge from an MT model is to use the MT as a teacher to distill knowledge for the ST training .…”

Section: Introductionmentioning

confidence: 99%

“…In this case, the model is jointly trained with two (or more) loss functions and usually the model is composed of 3 components: i) a shared encoder, ii) a decoder which generates the transcription, and iii) a decoder which generates the translation. We adopt the slightly different approach introduced by (Bahar et al, 2019a), which does not introduce an additional decoder but relies on the CTC loss in order to predict the transcriptions (Kim et al, 2017). As this multi-task learning has been proposed for speech recognition and has demonstrated to be useful in that scenario, we also include the CTC loss in ASR pretraining.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Proceedings of the 17th International Conference on Spoken Language Translation

Federico¹,

Waibel²,

Knight³

et al. 2020

View full text Add to dashboard Cite

The conference chairs and organizers would like to express their gratitude to everyone who contributed and supported IWSLT. Our IWSLT-20 program exceeds all our expectations in quality and breath, particularly when considering the challenges during a pandemic under lock-downs and health and travel restrictions. We thank the challenge track chairs, organizers, and participants, the program chairs and committee members, as well as all the authors that went the extra mile to submit system and research papers to IWSLT, and make this year's conference our most vibrant than ever. We also wish to express our sincere gratitude to ACL for hosting our conference and for arranging the logistics and infrastructure that allow us to hold IWSLT 2020 as a virtual online conference.

show abstract

Section: End-to-end Direct Speech Translationmentioning

confidence: 99%

Section: Architecturesmentioning

confidence: 99%

Section: Multi-task Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations