Abstract:Document-level contextual information has shown benefits to text-based machine translation, but whether and how context helps endto-end (E2E) speech translation (ST) is still under-studied. We fill this gap through extensive experiments using a simple concatenationbased context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency. We investigate several decoding approaches, and introduce inmodel ensemble decoding which jointly performs document-and sentence-le… Show more
“…We focus on improving translation quality of conversations by speaker-turn and cross-talk detection, yet using the context information could also help. In addition, within each MT-MS segment, the inter-utterance context could have already been leveraged (Zhang et al, 2021). We leave analysis of the interand intra-segment context as future work.…”
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single-and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training. 1 * Work conducted during an internship at Amazon.
“…We focus on improving translation quality of conversations by speaker-turn and cross-talk detection, yet using the context information could also help. In addition, within each MT-MS segment, the inter-utterance context could have already been leveraged (Zhang et al, 2021). We leave analysis of the interand intra-segment context as future work.…”
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single-and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training. 1 * Work conducted during an internship at Amazon.
“…Context-aware ST models have been shown to be robust towards error-prone automatic segmentations of the test set at inference time (Zhang et al, 2021a). Our method bears similarities with Gaido et al (2020b); Papi et al (2021) in that it re-segments the train set to create synthetic data.…”
End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentencelevel version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SEGAUGMENT, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SEGAUGMENT obtains stateof-the-art results in MuST-C. Finally, we show that the proposed method can also successfully augment sentence-level datasets, and that it enables Speech Translation models to close the gap between the manual and automatic segmentation at inference time.
“…With regard to exploiting streaming history, or more generally sentence context, it is worth mentioning the significant amount of previous work in offline MT at sentence level (Tiedemann and Scherrer, 2017;Agrawal et al, 2018), document level (Scherrer et al, 2019;Ma et al, 2020a;Zheng et al, 2020b;Li et al, 2020;Maruf et al, 2021;Zhang et al, 2021), and in related areas such as language modelling (Dai et al, 2019) that has proved to lead to quality gains. Also, as reported in (Li et al, 2020), more robust ST systems can be trained by taking advantage of the context across sentence boundaries using a data augmentation strategy similar to the prefix training methods proposed in (Niehues et al, 2018;Ma et al, 2019).…”
Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentencelevel MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.