Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Rossenbach, Nick; Zeyer, Albert; Schlüter, Ralf; Ney, Hermann

doi:10.1109/icassp40776.2020.9053008

Cited by 59 publications

(51 citation statements)

References 21 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An advantage of using supervised learning is that the method can be trained to handle such difficulties. In our experiments, training with either synthetic [32], [39] or real speech yields similar performance. However, for music, training with synthetic data is not effective.…”

Section: B Insightsmentioning

confidence: 69%

On Improved Training of CNN for Acoustic Source Localisation

Vargas

Hopgood

Brown

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrallyflat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.

show abstract

Section: B Insightsmentioning

confidence: 69%

On Improved Training of CNN for Acoustic Source Localisation

Vargas

Hopgood

Brown

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Instead of directly combining a portion of the original real data with synthetic data, as in previous studies [9,11,19], we propose to sample data on-the-fly from the source domain real data and the target domain synthetic data. The sampling distribution is a global and configurable hyperparameter that propagates into each training batch.…”

Section: Sampled Data Combinationmentioning

confidence: 99%

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Zheng

Liu

Gunceler

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using the extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original training data and extra synthetic data with elastic weight consolidation (EWC) applied on the encoder. This yields a 57% relative word error rate (WER) reduction on utterances containing OOV words without any degradation on the whole test set.

show abstract

“…We refer to it as SYNTH SPEECH, and its statistics can be found in Table 1. Our text-to-speech synthesis (TTS) model is trained on ASR LibriSpeech dataset as described in (Rossenbach et al, 2020). Using the TTS model, we synthesize 800k random samples (total of 5M words as listed in Table 1) from the OpenSubtitles corpus pre-filtered as described in Section 2.3.…”

Section: End-to-end Direct Speech Translationmentioning

confidence: 99%

Proceedings of the 17th International Conference on Spoken Language Translation

Federico¹,

Waibel²,

Knight³

et al. 2020

View full text Add to dashboard Cite

The conference chairs and organizers would like to express their gratitude to everyone who contributed and supported IWSLT. Our IWSLT-20 program exceeds all our expectations in quality and breath, particularly when considering the challenges during a pandemic under lock-downs and health and travel restrictions. We thank the challenge track chairs, organizers, and participants, the program chairs and committee members, as well as all the authors that went the extra mile to submit system and research papers to IWSLT, and make this year's conference our most vibrant than ever. We also wish to express our sincere gratitude to ACL for hosting our conference and for arranging the logistics and infrastructure that allow us to hold IWSLT 2020 as a virtual online conference.

show abstract

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Cited by 59 publications

References 21 publications

On Improved Training of CNN for Acoustic Source Localisation

On Improved Training of CNN for Acoustic Source Localisation

Using Synthetic Audio to Improve the Recognition of Out-of-Vocabulary Words in End-to-End Asr Systems

Proceedings of the 17th International Conference on Spoken Language Translation

Contact Info

Product

Resources

About