2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639619
|View full text |Cite
|
Sign up to set email alerts
|

Back-Translation-Style Data Augmentation for end-to-end ASR

Abstract: In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
64
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 85 publications
(64 citation statements)
references
References 33 publications
0
64
0
Order By: Relevance
“…Furthermore, since our ESPnet-TTS is an extension of ESPnet, both ASR and TTS recipes are based on a unified design, which allows us to easily integrate ASR functions with TTS. For example, ASRbased objective evaluation for TTS systems and advanced research topics such as the semi-supervised learning [28]- [31] can be realized by combining ASR and TTS modules in the unified framework.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, since our ESPnet-TTS is an extension of ESPnet, both ASR and TTS recipes are based on a unified design, which allows us to easily integrate ASR functions with TTS. For example, ASRbased objective evaluation for TTS systems and advanced research topics such as the semi-supervised learning [28]- [31] can be realized by combining ASR and TTS modules in the unified framework.…”
Section: Related Workmentioning
confidence: 99%
“…To further increase the performance of end-to-end systems in low resource conditions, untranscribed speech or text can be used as additional training data. A previously published approach is the text-to-encoder (TTE) model which can integrate additional text [4] or untranscribed speech [5] into ASR training. Another method is the joint training of ASR and text-to-speech (TTS) systems such as the Speech Chain approach [6][7][8] or variants of it [9].…”
Section: Introduction and Related Workmentioning
confidence: 99%
“…Comparing Against Semi-supervised Methods We also listed the performance obtained with the same setting re- ported by prior works (referred to as "semi-supervised") for comparison. Our word embedding regularization surpassed the back-translation data augmentation method [8] (row(d)) yet still performed worse than the adversarial training method [11] (row(e)). With fused decoding, we further narrowed the gap.…”
Section: Results On Low Resource Asrmentioning
confidence: 99%
“…With fused decoding, we further narrowed the gap. However, it is worth mentioning that all the semi-supervised methods listed in Table 2 required ASR counterpart training (a text-to-speech model [10,8] or a discriminator [11]) to optimize the performance at the price of higher computational resource. But our methods add nearly no cost 1 in training.…”
Section: Results On Low Resource Asrmentioning
confidence: 99%
See 1 more Smart Citation