2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461690
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Automatic Speech Translation of Audiobooks

Abstract: We investigate end-to-end speech-to-text translation on a corpus of audiobooks specifically augmented for this task. Previous works investigated the extreme case where source language transcription is not available during learning nor decoding, but we also study a midway case where source language transcription is available at training time only. In this case, a single model is trained to decode source speech into target text in a single pass. Experimental results show that it is possible to train compact and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
240
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 167 publications
(270 citation statements)
references
References 10 publications
3
240
0
Order By: Relevance
“…SKINAUGMENT improves BLEU by 3.3 points for En-Ro and 2.2 for En-Fr over the end-to-end baseline. Our score of 14.58 matches the reported En-Fr score of [13] with a cascade model (14.6), up to their reported significant figures.…”
Section: Resultssupporting
confidence: 87%
See 2 more Smart Citations
“…SKINAUGMENT improves BLEU by 3.3 points for En-Ro and 2.2 for En-Fr over the end-to-end baseline. Our score of 14.58 matches the reported En-Fr score of [13] with a cascade model (14.6), up to their reported significant figures.…”
Section: Resultssupporting
confidence: 87%
“…An AST dataset pairs source-language audio with a targetlanguage translation. We experiment on two standard AST datasets: AST LibriSpeech [12] (English-French; we use the same setup as [13]) and MuST-C (English-Romanian; 432 hours) [14]. We also use AST LibriSpeech for low resource ASR.…”
Section: Datasets and Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…For the MT training, we use the TED, OpenSubtitles2018, Europarl, ParaCrawl, CommonCrawl, News Commentary, and Rapid corpora resulting in 32M sentence pairs after filtering noisy samples. LibriSpeech En→Fr: Similar to [18], to increase the training data size, we add the original translation and the Google Translate reference provided in the dataset package. It results in 200h of speech corresponding to 94.5k segments for the ST task.…”
Section: Pre-trainingmentioning
confidence: 99%
“…The end-to-end model has advantages over the cascaded pipeline, however, its training requires a moderate amount of paired speech-to-text data which is not easy to acquire. Therefore, recently some techniques such as multitask learning [13,[15][16][17], pre-training different components of the model [18][19][20] and generating synthetic data [21] have been proposed to mitigate the lack of ST parallel training data. These methods aim to use weakly supervised data, i.e.…”
Section: Introductionmentioning
confidence: 99%