2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003774
|View full text |Cite
|
Sign up to set email alerts
|

A Comparative Study on End-to-End Speech to Text Translation

Abstract: Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
68
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 72 publications
(80 citation statements)
references
References 24 publications
0
68
0
Order By: Relevance
“…After initialization with pre-trained components, we fine-tune on the DST training data. As proposed in (Bahar et al, 2019a), in order to familiarize the pre-trained text decoder with the output of the pre-trained speech encoder, we insert an additional adaptor layer which is a BiLSTM layer between the encoder and decoder. We train the adaptor component jointly without freezing the parameters in the fine-tuning stage.…”
Section: End-to-end Direct Speech Translationmentioning
confidence: 99%
See 4 more Smart Citations
“…After initialization with pre-trained components, we fine-tune on the DST training data. As proposed in (Bahar et al, 2019a), in order to familiarize the pre-trained text decoder with the output of the pre-trained speech encoder, we insert an additional adaptor layer which is a BiLSTM layer between the encoder and decoder. We train the adaptor component jointly without freezing the parameters in the fine-tuning stage.…”
Section: End-to-end Direct Speech Translationmentioning
confidence: 99%
“…The ST encoder is initialized with the ASR encoder (except for the additional 3 layers that are initialized with random values). The decision of having a different number of encoder layers in the two encoders is motivated by the idea of introducing adaptation layers, which (Bahar et al, 2019a) reported to be essential when initializing the decoder with that of a pretrained MT model.…”
Section: Architecturesmentioning
confidence: 99%
See 3 more Smart Citations