ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053008
|View full text |Cite
|
Sign up to set email alerts
|

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Abstract: Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end ASR systems without the necessity of parameter or architecture changes. We compare our method with language model int… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
50
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 59 publications
(51 citation statements)
references
References 21 publications
(32 reference statements)
1
50
0
Order By: Relevance
“…An advantage of using supervised learning is that the method can be trained to handle such difficulties. In our experiments, training with either synthetic [32], [39] or real speech yields similar performance. However, for music, training with synthetic data is not effective.…”
Section: B Insightsmentioning
confidence: 69%
“…An advantage of using supervised learning is that the method can be trained to handle such difficulties. In our experiments, training with either synthetic [32], [39] or real speech yields similar performance. However, for music, training with synthetic data is not effective.…”
Section: B Insightsmentioning
confidence: 69%
“…Instead of directly combining a portion of the original real data with synthetic data, as in previous studies [9,11,19], we propose to sample data on-the-fly from the source domain real data and the target domain synthetic data. The sampling distribution is a global and configurable hyperparameter that propagates into each training batch.…”
Section: Sampled Data Combinationmentioning
confidence: 99%
“…We refer to it as SYNTH SPEECH, and its statistics can be found in Table 1. Our text-to-speech synthesis (TTS) model is trained on ASR LibriSpeech dataset as described in (Rossenbach et al, 2020). Using the TTS model, we synthesize 800k random samples (total of 5M words as listed in Table 1) from the OpenSubtitles corpus pre-filtered as described in Section 2.3.…”
Section: End-to-end Direct Speech Translationmentioning
confidence: 99%