Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.206
|View full text |Cite
|
Sign up to set email alerts
|

Direct Segmentation Models for Streaming Speech Translation

Abstract: The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for stream… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 34 publications
(28 reference statements)
0
10
0
Order By: Relevance
“…The proposed model in Section 2 assumes that at inference time the source stream has been segmented into sentences. To this purpose, we opt for the text-based DS model (Iranzo-Sánchez et al, 2020), a sliding-window segmenter that moves over the source stream taking a split decision at each token based on a local-context window that extends to both past and future tokens. This segmenter is streaming-ready and obtains superior translation quality when compared with other segmenters (Stolcke, 2002;Cho et al, 2017).…”
Section: Partial Bidirectional Encodermentioning
confidence: 99%
See 1 more Smart Citation
“…The proposed model in Section 2 assumes that at inference time the source stream has been segmented into sentences. To this purpose, we opt for the text-based DS model (Iranzo-Sánchez et al, 2020), a sliding-window segmenter that moves over the source stream taking a split decision at each token based on a local-context window that extends to both past and future tokens. This segmenter is streaming-ready and obtains superior translation quality when compared with other segmenters (Stolcke, 2002;Cho et al, 2017).…”
Section: Partial Bidirectional Encodermentioning
confidence: 99%
“…In this work, the simultaneous MT model based on a unidirectional encoder-decoder and training along multiple wait-k paths proposed by (Elbayad et al, 2020a) is evolved into a streamingready simultaneous MT model. To achieve this, model training is performed following a sentenceboundary sliding-window strategy over the parallel stream that exploits the idea of prefix training, while inference is carried out in a single forward pass on the source stream that is segmented by a Direct Segmentation (DS) model (Iranzo-Sánchez et al, 2020). In addition, a refinement of the unidirectional encoder-decoder that takes advantage of longer context for encoding the initial positions of the streaming MT process is proposed.…”
Section: Introductionmentioning
confidence: 99%
“…Traditional simultaneous speech-to-text translation (SST) mainly depends on the ASR segmentation and then performs NMT based on the stream- ing segmented chunks (Oda et al, 2014;Iranzo-Sánchez et al, 2020). There is little attention on end-to-end SST.…”
Section: Related Workmentioning
confidence: 99%
“…Context-aware ST extends the sentence-level ST towards streaming ST which allows models to access unlimited previous audio inputs. Instead of improving contextual modeling, many studies on streaming ST aim at developing better sentence/word segmentation policies to avoid segmentation errors that greatly hurt translation (Matusov et al, 2007;Rangarajan Sridhar et al, 2013;Iranzo-Sánchez et al, 2020;Arivazhagan et al, 2020b). Very recently, Ma et al (2020b) proposed a memory augmented Transformer encoder for streaming ST, where the previous audio features are summarized into a growing continuous memory to improve the model's context awareness.…”
Section: Related Workmentioning
confidence: 99%