“…Text-only data can also be converted to TTS utterances for training the whole deliberation decoder. We employ JATD training [11] and scale it up using text data sampled from multiple domains, i.e., 51M, 20M, 1.6M, 0.6M, and 11M text sentences from Maps, News, Play, Search and YouTube domains, respectively. In comparison, [11] uses only 4.6M samples from the Maps domain.…”