End-to-End Automatic Speech Translation of Audiobooks

Bérard, Alexandre; Besacier, Laurent; Kocabiyikoglu, Ali Can; Pietquin, Olivier

doi:10.1109/icassp.2018.8461690

Cited by 168 publications

(270 citation statements)

References 10 publications

Supporting

Mentioning

240

Contrasting

Order By: Relevance

“…SKINAUGMENT improves BLEU by 3.3 points for En-Ro and 2.2 for En-Fr over the end-to-end baseline. Our score of 14.58 matches the reported En-Fr score of [13] with a cascade model (14.6), up to their reported significant figures.…”

Section: Resultssupporting

confidence: 87%

“…An AST dataset pairs source-language audio with a targetlanguage translation. We experiment on two standard AST datasets: AST LibriSpeech [12] (English-French; we use the same setup as [13]) and MuST-C (English-Romanian; 432 hours) [14]. We also use AST LibriSpeech for low resource ASR.…”

Section: Datasets and Evaluationmentioning

confidence: 99%

“…Furthermore, we demonstrate performance when using the original AST corpus from the augmented LibriSpeech [12], i.e. removing the off-the-shelf automatic translations added in [13]'s dataset ("− AT"). We speculate that the abysmal performance of the baseline AST LibriSpeech − AT is due to data scarcity, and that removing automatic translations for + MT helps because they are of lower quality.…”

Section: Machine-translated Data For Augmentationmentioning

confidence: 99%

See 2 more Smart Citations

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

McCarthy

Puzon

Pino

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English-French and English-Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augmented data. Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-toend AST model that outperforms a very strong cascade model on an English-French AST task. Our method is sufficiently general that it can be applied to other speech generation and analysis tasks.

show abstract

Section: Resultssupporting

confidence: 87%

Section: Datasets and Evaluationmentioning

confidence: 99%

Section: Machine-translated Data For Augmentationmentioning

confidence: 99%

See 1 more Smart Citation

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

McCarthy

Puzon

Pino

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For the MT training, we use the TED, OpenSubtitles2018, Europarl, ParaCrawl, CommonCrawl, News Commentary, and Rapid corpora resulting in 32M sentence pairs after filtering noisy samples. LibriSpeech En→Fr: Similar to [18], to increase the training data size, we add the original translation and the Google Translate reference provided in the dataset package. It results in 200h of speech corresponding to 94.5k segments for the ST task.…”

Section: Pre-trainingmentioning

confidence: 99%

“…The end-to-end model has advantages over the cascaded pipeline, however, its training requires a moderate amount of paired speech-to-text data which is not easy to acquire. Therefore, recently some techniques such as multitask learning [13,[15][16][17], pre-training different components of the model [18][19][20] and generating synthetic data [21] have been proposed to mitigate the lack of ST parallel training data. These methods aim to use weakly supervised data, i.e.…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study on End-to-End Speech to Text Translation

Bahar

Bieschke

Ney

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives boosts up to 4% in BLEU and 5% in TER. Our experiments are performed on 270h IWSLT TED-talks En→De, and 100h LibriSpeech Audiobooks En→Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.

show abstract