Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2860
|View full text |Cite
|
Sign up to set email alerts
|

Contextualized Translation of Automatically Segmented Speech

Abstract: Direct speech-to-text translation (ST) models are usually trained on corpora segmented at sentence level, but at inference time they are commonly fed with audio split by a voice activity detector (VAD). Since VAD segmentation is not syntaxinformed, the resulting segments do not necessarily correspond to well-formed sentences uttered by the speaker but, most likely, to fragments of one or more sentences. This segmentation mismatch degrades considerably the quality of ST models' output. So far, researchers have … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 0 publications
1
9
0
Order By: Relevance
“…Our study still relies on oracle sentence segmentation of the audio. The most related work to ours is (Gaido et al, 2020), which also investigated contextualized translation and showed that contextaware ST is less sensitive to audio segmentation errors. While they exclusively focus on the robustness to segmentation errors, our study investigates the benefits of context-aware E2E ST more broadly.…”
Section: Related Workmentioning
confidence: 99%
“…Our study still relies on oracle sentence segmentation of the audio. The most related work to ours is (Gaido et al, 2020), which also investigated contextualized translation and showed that contextaware ST is less sensitive to audio segmentation errors. While they exclusively focus on the robustness to segmentation errors, our study investigates the benefits of context-aware E2E ST more broadly.…”
Section: Related Workmentioning
confidence: 99%
“…Both knowledge distillation and the first fine-tuning step (optimized by combining label smoothed cross entropy and the CTC scoring function described in Gaido et al 2020b) are carried out on manually segmented real and synthetic data. The second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset, aimed to make the system robust to automatically segmented test audio data (Gaido et al, 2020a). For the same purpose, a custom hybrid segmentation procedure is applied to the test data before passing them to the system.…”
Section: Submissionsmentioning
confidence: 99%
“…The models are trained using CTC attention loss, spectrogram augmentation, pretraining, synthetic data using forward translation, and fine-tuned on the in-domain TED talks. Following Gaido et al (2020a), the direct model is also fine-tuned on automatically segmented data to increase its robustness against sub-optimal non-homogeneous utterances.…”
Section: Submissionsmentioning
confidence: 99%
“…How to segment audio during inference significantly impacts ST performances (Gaido et al, 2020;Pham et al, 2020;Potapczyk and Przybysz, 2020;Gaido et al, 2021). This is because the ST systems are usually trained with utterances segmented based on punctuation marks (Di Gangi et al, 2019) while the audio segmentation by voice activity detection (VAD) at test time does not access such meta information.…”
Section: Segmentationmentioning
confidence: 99%