Contextualized Translation of Automatically Segmented Speech

Gaido, Marco; Gangi, Mattia Antonino Di; Negri, Matteo; Cettolo, Mauro; Turchi, Marco

doi:10.21437/interspeech.2020-2860

Cited by 15 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our study still relies on oracle sentence segmentation of the audio. The most related work to ours is (Gaido et al, 2020), which also investigated contextualized translation and showed that contextaware ST is less sensitive to audio segmentation errors. While they exclusively focus on the robustness to segmentation errors, our study investigates the benefits of context-aware E2E ST more broadly.…”

Section: Related Workmentioning

confidence: 99%

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Zhang

Titov²,

Haddow³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Document-level contextual information has shown benefits to text-based machine translation, but whether and how context helps endto-end (E2E) speech translation (ST) is still under-studied. We fill this gap through extensive experiments using a simple concatenationbased context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency. We investigate several decoding approaches, and introduce inmodel ensemble decoding which jointly performs document-and sentence-level translation using the same model. Our results on the MuST-C benchmark with Transformer demonstrate the effectiveness of context to E2E ST. Compared to sentence-level ST, context-aware ST obtains better translation quality (+0.18-2.61 BLEU), improves pronoun and homophone translation, shows better robustness to (artificial) audio segmentation errors, and reduces latency and flicker to deliver higher quality for simultaneous translation. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Zhang

Titov²,

Haddow³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Both knowledge distillation and the first fine-tuning step (optimized by combining label smoothed cross entropy and the CTC scoring function described in Gaido et al 2020b) are carried out on manually segmented real and synthetic data. The second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset, aimed to make the system robust to automatically segmented test audio data (Gaido et al, 2020a). For the same purpose, a custom hybrid segmentation procedure is applied to the test data before passing them to the system.…”

Section: Submissionsmentioning

confidence: 99%

“…The models are trained using CTC attention loss, spectrogram augmentation, pretraining, synthetic data using forward translation, and fine-tuned on the in-domain TED talks. Following Gaido et al (2020a), the direct model is also fine-tuned on automatically segmented data to increase its robustness against sub-optimal non-homogeneous utterances.…”

Section: Submissionsmentioning

confidence: 99%

Findings of the Iwslt 2021 Evaluation Campaign

Anastasopoulos

Bojar

Bremerman³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation. A total of 22 teams participated in at least one of the tasks. This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions.

show abstract

“…How to segment audio during inference significantly impacts ST performances (Gaido et al, 2020;Pham et al, 2020;Potapczyk and Przybysz, 2020;Gaido et al, 2021). This is because the ST systems are usually trained with utterances segmented based on punctuation marks (Di Gangi et al, 2019) while the audio segmentation by voice activity detection (VAD) at test time does not access such meta information.…”

Section: Segmentationmentioning

confidence: 99%

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Inaguma¹,

Yan²,

Dalmia³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

View full text Add to dashboard Cite

This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021. * *Equal contribution 1 https://sites.google.com/ view/iwslt-evaluation-2019/ speech-translation/off-limit-ted-talks 2 https://ict.fbk.eu/ must-c-release-v2-0/

show abstract

Contextualized Translation of Automatically Segmented Speech

Cited by 15 publications

References 0 publications

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Findings of the Iwslt 2021 Evaluation Campaign

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Contact Info

Product

Resources

About