ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053406
|View full text |Cite
|
Sign up to set email alerts
|

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

Abstract: We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English-French and English-Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augme… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 12 publications
(21 reference statements)
0
6
0
Order By: Relevance
“…Usually, low-resource ST tasks focus on feature enhancement and model optimization. On the one hand, techniques such as data augmentation [10,[13][14][15], multitask learning [16][17][18], and pretraining of ASR data [8,[19][20][21] are used to enhance the feature representation. On the other hand, knowledge refinement [22], selftraining [23], or multilingual ST [24][25][26][27][28] have been used to address actual scarcity in speech translation.…”
Section: Introductionmentioning
confidence: 99%
“…Usually, low-resource ST tasks focus on feature enhancement and model optimization. On the one hand, techniques such as data augmentation [10,[13][14][15], multitask learning [16][17][18], and pretraining of ASR data [8,[19][20][21] are used to enhance the feature representation. On the other hand, knowledge refinement [22], selftraining [23], or multilingual ST [24][25][26][27][28] have been used to address actual scarcity in speech translation.…”
Section: Introductionmentioning
confidence: 99%
“…Although the cascaded model by having access to all the pretrained parameters (the encoder and decoder of both NMT and ASR) still has better translation quality, we can bring the performance of an end-to-end model closer to it by adding the new regularizer. It is also important to note that since we are not changing the final structure of the AST model, most of the other techniques for further improving the translation quality, such as data augmentation, which was examined in previous studies (McCarthy et al, 2020;Park et al, 2019) can also be applied. But we won't study them in this paper.…”
Section: Using Both Ast and External Datamentioning
confidence: 99%
“…After the success of (Weiss et al, 2017) in creating a powerful model for ST systems, more recent studies focused on exploring their power, and one of the main approaches to boost the performance of such models is to make use of available data from other tasks, such as ASR and NMT. (Weiss et al, 2017;Anastasopoulos and Chiang, 2018;Sperber et al, 2019) show that multitask learning can be effective and (Jia et al, 2019;Pino et al, 2019;Park et al, 2019;McCarthy et al, 2020) investigate various data augmentation techniques. The impact of pretraining the encoder with ASR model is also studied in (Berard et al, 2018;Bansal et al, 2018Bansal et al, , 2019.…”
Section: Related Workmentioning
confidence: 99%
“…On CoVoST, we use a character-level vocabulary, with 54 characters including English alphabet and numerical characters, punctuations and the markers for fairseq [26] dictionary. On MuST-C, we choose a unigram vocabulary of size 10000 as in [27] to better balance the training time, as the sentences in MuST-C are generally longer. The vocabulary is obtained using SentencePiece [28].…”
Section: Preprocessingmentioning
confidence: 99%