Back-Translation-Style Data Augmentation for end-to-end ASR

Hayashi, Tomoki; Watanabe, Shinji; Zhang, Yu; Toda, Tomoki; Hori, Takaaki; Astudillo, Ramón Fernandez; Takeda, Kazuya

doi:10.1109/slt.2018.8639619

Cited by 85 publications

(64 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, since our ESPnet-TTS is an extension of ESPnet, both ASR and TTS recipes are based on a unified design, which allows us to easily integrate ASR functions with TTS. For example, ASRbased objective evaluation for TTS systems and advanced research topics such as the semi-supervised learning [28]- [31] can be realized by combining ASR and TTS modules in the unified framework.…”

Section: Related Workmentioning

confidence: 99%

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi

Yamamoto

Inoue

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

147

View full text Add to dashboard Cite

This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-theart E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our best model outperforms other toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is available on GitHub 1 .

show abstract

Section: Related Workmentioning

confidence: 99%

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi

Yamamoto

Inoue

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

147

View full text Add to dashboard Cite

show abstract

“…To further increase the performance of end-to-end systems in low resource conditions, untranscribed speech or text can be used as additional training data. A previously published approach is the text-to-encoder (TTE) model which can integrate additional text [4] or untranscribed speech [5] into ASR training. Another method is the joint training of ASR and text-to-speech (TTS) systems such as the Speech Chain approach [6][7][8] or variants of it [9].…”

Section: Introduction and Related Workmentioning

confidence: 99%

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Rossenbach

Zeyer

Schlüter

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end ASR systems without the necessity of parameter or architecture changes. We compare our method with language model integration of the same text data and with simple data augmentation methods like SpecAugment and show that performance improvements are mostly independent. We achieve improvements of up to 33% relative in word-error-rate (WER) over a strong baseline with data-augmentation in a low-resource environment (LibriSpeech-100h), closing the gap to a comparable oracle experiment by more than 50%. We also show improvements of up to 5% relative WER over our most recent ASR baseline on LibriSpeech-960h.

show abstract

“…Comparing Against Semi-supervised Methods We also listed the performance obtained with the same setting re- ported by prior works (referred to as "semi-supervised") for comparison. Our word embedding regularization surpassed the back-translation data augmentation method [8] (row(d)) yet still performed worse than the adversarial training method [11] (row(e)). With fused decoding, we further narrowed the gap.…”

Section: Results On Low Resource Asrmentioning

confidence: 99%

“…With fused decoding, we further narrowed the gap. However, it is worth mentioning that all the semi-supervised methods listed in Table 2 required ASR counterpart training (a text-to-speech model [10,8] or a discriminator [11]) to optimize the performance at the price of higher computational resource. But our methods add nearly no cost 1 in training.…”

Section: Results On Low Resource Asrmentioning

confidence: 99%

“…To this end, substantial effort had been made to utilize pure text data for end-to-end ASR training. Typical Code available at https://github.com/Alexander-H-Liu/ End-to-end-ASR-Pytorch examples include the classic language model rescoring methods [6,7], training an additional text-to-speech model to produce pseudo paired speech-text data [8,9,10], and achieving adversarial learning by training an extra criticizing language model [11]. These methods pointed out the potential help that pure text data can offer in improving recognition performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sequence-to-Sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

Liu

Sung

Chuang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we investigate the benefit that off-the-shelf word embedding can bring to the sequence-to-sequence (seqto-seq) automatic speech recognition (ASR). We first introduced the word embedding regularization by maximizing the cosine similarity between a transformed decoder feature and the target word embedding. Based on the regularized decoder, we further proposed the fused decoding mechanism. This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding. Initial results on LibriSpeech demonstrated that pre-trained word embedding can significantly lower ASR recognition error with a negligible cost, and the choice of word embedding algorithms among Skip-gram, CBOW and BERT is important.

show abstract

Back-Translation-Style Data Augmentation for end-to-end ASR

Cited by 85 publications

References 33 publications

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Sequence-to-Sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

Contact Info

Product

Resources

About