Direct Segmentation Models for Streaming Speech Translation

Iranzo-Sánchez, Javier; Pastor, Adrià Giménez; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons

doi:10.18653/v1/2020.emnlp-main.206

Cited by 15 publications

(14 citation statements)

References 34 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed model in Section 2 assumes that at inference time the source stream has been segmented into sentences. To this purpose, we opt for the text-based DS model (Iranzo-Sánchez et al, 2020), a sliding-window segmenter that moves over the source stream taking a split decision at each token based on a local-context window that extends to both past and future tokens. This segmenter is streaming-ready and obtains superior translation quality when compared with other segmenters (Stolcke, 2002;Cho et al, 2017).…”

Section: Partial Bidirectional Encodermentioning

confidence: 99%

“…In this work, the simultaneous MT model based on a unidirectional encoder-decoder and training along multiple wait-k paths proposed by (Elbayad et al, 2020a) is evolved into a streamingready simultaneous MT model. To achieve this, model training is performed following a sentenceboundary sliding-window strategy over the parallel stream that exploits the idea of prefix training, while inference is carried out in a single forward pass on the source stream that is segmented by a Direct Segmentation (DS) model (Iranzo-Sánchez et al, 2020). In addition, a refinement of the unidirectional encoder-decoder that takes advantage of longer context for encoding the initial positions of the streaming MT process is proposed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History

Iranzo-Sánchez¹,

Civera²,

Juan³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentencelevel MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems.

show abstract

Section: Partial Bidirectional Encodermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History

Iranzo-Sánchez¹,

Civera²,

Juan³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Traditional simultaneous speech-to-text translation (SST) mainly depends on the ASR segmentation and then performs NMT based on the stream- ing segmented chunks (Oda et al, 2014;Iranzo-Sánchez et al, 2020). There is little attention on end-to-end SST.…”

Section: Related Workmentioning

confidence: 99%

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Zeng¹,

Li²,

Li³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in realtime, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models as well as cascaded models in diverse latency settings.

show abstract

“…Context-aware ST extends the sentence-level ST towards streaming ST which allows models to access unlimited previous audio inputs. Instead of improving contextual modeling, many studies on streaming ST aim at developing better sentence/word segmentation policies to avoid segmentation errors that greatly hurt translation (Matusov et al, 2007;Rangarajan Sridhar et al, 2013;Iranzo-Sánchez et al, 2020;Arivazhagan et al, 2020b). Very recently, Ma et al (2020b) proposed a memory augmented Transformer encoder for streaming ST, where the previous audio features are summarized into a growing continuous memory to improve the model's context awareness.…”

Section: Related Workmentioning

confidence: 99%

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Zhang

Titov²,

Haddow³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Document-level contextual information has shown benefits to text-based machine translation, but whether and how context helps endto-end (E2E) speech translation (ST) is still under-studied. We fill this gap through extensive experiments using a simple concatenationbased context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency. We investigate several decoding approaches, and introduce inmodel ensemble decoding which jointly performs document-and sentence-level translation using the same model. Our results on the MuST-C benchmark with Transformer demonstrate the effectiveness of context to E2E ST. Compared to sentence-level ST, context-aware ST obtains better translation quality (+0.18-2.61 BLEU), improves pronoun and homophone translation, shows better robustness to (artificial) audio segmentation errors, and reduces latency and flicker to deliver higher quality for simultaneous translation. 1

show abstract

Direct Segmentation Models for Streaming Speech Translation

Cited by 15 publications

References 34 publications

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History

From Simultaneous to Streaming Machine Translation by Leveraging Streaming History

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

Beyond Sentence-Level End-to-End Speech Translation: Context Helps

Contact Info

Product

Resources

About