2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383462
|View full text |Cite
|
Sign up to set email alerts
|

Tight Integrated End-to-End Training for Cascaded Speech Translation

Abstract: A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 25 publications
(20 citation statements)
references
References 34 publications
(26 reference statements)
0
20
0
Order By: Relevance
“…Speech translation (ST) has recently attracted intensive attention from the AI community. Earlier works are mostly based on cascaded models, which perform NMT on the outputs of ASR systems (Ney, 1999;Mathias and Byrne, 2006;Sperber et al, 2017;Bahar et al, 2021). Cascaded models inevitably introduce error propagation from ASR (Weiss et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…Speech translation (ST) has recently attracted intensive attention from the AI community. Earlier works are mostly based on cascaded models, which perform NMT on the outputs of ASR systems (Ney, 1999;Mathias and Byrne, 2006;Sperber et al, 2017;Bahar et al, 2021). Cascaded models inevitably introduce error propagation from ASR (Weiss et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…ASR exploits an attentionbased model (Bahdanau et al, 2015;Vaswani et al, 2017) trained following Zeyer et al (2018), while the MT component is based on the big Transformer model model. Passing on the renormalized ASR posteriors into the MT model, the model is trained in an end-to-end fashion (inspired by the posterior tight integrated model by Bahar et al 2021a) using all ASR, MT, and ST available training data. The system uses an improved automatic segmentation based on voice activity detection (VAD) and endpoint detection (EP).…”
Section: Submissionsmentioning
confidence: 99%
“…The posterior model is inspired by Bahar et al (2021) where the cascade components, i.e. the endto-end ASR and MT models, are collapsed into a single end-to-end trainable model.…”
Section: Posterior Tight Integrationmentioning
confidence: 99%
“…For the offline end-to-end translation task, we train deep Transformer models that benefit from pretraining, data augmentation in the form of synthetic data and SpecAugment, as well as domain adaptation on TED talks. Motivated by Bahar et al (2021), we also collapse the ASR and MT components into a posterior model which passes on the ASR posteriors as input to the MT model. This system is not considered a direct model since it is closer to * equal contribution the cascade system while being end-to-end trainable.…”
Section: Introductionmentioning
confidence: 99%