A Deliberation-Based Joint Acoustic and Text Decoder

Sepand, Mavandadi,; Sainath, Tara N.; Hu, Kevin; Wu, Zelin

doi:10.21437/interspeech.2021-165

Cited by 5 publications

(8 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Text-only data can also be converted to TTS utterances for training the whole deliberation decoder. We employ JATD training [11] and scale it up using text data sampled from multiple domains, i.e., 51M, 20M, 1.6M, 0.6M, and 11M text sentences from Maps, News, Play, Search and YouTube domains, respectively. In comparison, [11] uses only 4.6M samples from the Maps domain.…”

Section: Large Scale Tts Trainingmentioning

confidence: 99%

“…While LM relies on only text hypotheses for rescoring, deliberation models have been recently proposed for second-pass rescoring using both text hypotheses and audio [9,10]. Compared to LM training, there has been few attempts at incorporating widely available text-only or audio-only data in deliberation (see [11]). In this work, we research various ways to utilize large-scale textonly and semi-supervised data for deliberation training.…”

Section: Introductionmentioning

confidence: 99%

“…Instead of training external modules such as LMs, several recent studies incorporate text-only data into supervised training to jointly train E2E models [16,11,17,18,19,20]. For example, text-only data has been used to train speech encoders [16,18].…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, ASR decoders have also been modified for text-only training. [11] extends a joint acoustic and text decoder (JATD) from the Listen, Attend and Spell (LAS) [21] to a deliberation decoder, and uses textonly data (or synthesized utterances) to train the decoder with fixed context vectors. [17] modifies the transformer decoder to have only self-attention (except for the last layer) so they can be trained by text-only data.…”

Section: Introductionmentioning

confidence: 99%

“…Our results show that pretraining a conformer text encoder with large enough size significantly improves recognition for both Voice Search and long-tail words. In addition to the text encoder, we also synthesize large-scale text-only data (84M) to TTS utterances in training the deliberation decoder using JATD [11]. Third, since deliberation attends to encoded audio, we perform large-scale semi-supervised training using 500M unlabeled speech utterances from Google Voice Search domain and transcribed using a conventional model.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Improving Deliberation by Text-Only and Semi-Supervised Training

Hu¹,

Sainath²,

He³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating textonly data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-theart language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.

show abstract

Section: Large Scale Tts Trainingmentioning

confidence: 99%