Tight Integrated End-to-End Training for Cascaded Speech Translation

Bahar, Parnia; Bieschke, Tobias; Schlüter, Ralf; Ney, Hermann

doi:10.1109/slt48900.2021.9383462

Cited by 25 publications

(20 citation statements)

References 34 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech translation (ST) has recently attracted intensive attention from the AI community. Earlier works are mostly based on cascaded models, which perform NMT on the outputs of ASR systems (Ney, 1999;Mathias and Byrne, 2006;Sperber et al, 2017;Bahar et al, 2021). Cascaded models inevitably introduce error propagation from ASR (Weiss et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Zeng¹,

Li²,

Li³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in realtime, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models as well as cascaded models in diverse latency settings.

show abstract

Section: Related Workmentioning

confidence: 99%

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Zeng¹,

Li²,

Li³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…ASR exploits an attentionbased model (Bahdanau et al, 2015;Vaswani et al, 2017) trained following Zeyer et al (2018), while the MT component is based on the big Transformer model model. Passing on the renormalized ASR posteriors into the MT model, the model is trained in an end-to-end fashion (inspired by the posterior tight integrated model by Bahar et al 2021a) using all ASR, MT, and ST available training data. The system uses an improved automatic segmentation based on voice activity detection (VAD) and endpoint detection (EP).…”

Section: Submissionsmentioning

confidence: 99%

Findings of the Iwslt 2021 Evaluation Campaign

Anastasopoulos

Bojar

Bremerman³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

View full text Add to dashboard Cite

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation. A total of 22 teams participated in at least one of the tasks. This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions.

show abstract

“…The posterior model is inspired by Bahar et al (2021) where the cascade components, i.e. the endto-end ASR and MT models, are collapsed into a single end-to-end trainable model.…”

Section: Posterior Tight Integrationmentioning

confidence: 99%

“…For the offline end-to-end translation task, we train deep Transformer models that benefit from pretraining, data augmentation in the form of synthetic data and SpecAugment, as well as domain adaptation on TED talks. Motivated by Bahar et al (2021), we also collapse the ASR and MT components into a posterior model which passes on the ASR posteriors as input to the MT model. This system is not considered a direct model since it is closer to * equal contribution the cascade system while being end-to-end trainable.…”

Section: Introductionmentioning

confidence: 99%

Without Further Ado: Direct and Simultaneous Speech Translation by AppTek in 2021

Bahar¹,

Wilken²,

Gangi³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

This paper describes the offline and simultaneous speech translation (ST) systems developed at AppTek for IWSLT 2021. Our offline ST submission includes the direct endto-end system and the so-called posterior tight integrated model, which is akin to the cascade system but is trained in an end-to-end fashion, where all the cascaded modules are end-to-end models themselves. For simultaneous ST, we combine hybrid automatic speech recognition (ASR) with a machine translation (MT) approach whose translation policy decisions are learned from statistical word alignments. Compared to last year, we improve general quality and provide a wider range of quality/latency trade-offs, both due to a data augmentation method making the MT model robust to varying chunk sizes. Finally, we present a method for ASR output segmentation into sentences that introduces a minimal additional delay.

show abstract

Tight Integrated End-to-End Training for Cascaded Speech Translation

Cited by 25 publications

References 34 publications

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Findings of the Iwslt 2021 Evaluation Campaign

Without Further Ado: Direct and Simultaneous Speech Translation by AppTek in 2021

Contact Info

Product

Resources

About