Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-165
|View full text |Cite
|
Sign up to set email alerts
|

A Deliberation-Based Joint Acoustic and Text Decoder

Abstract: We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 29 publications
1
7
0
Order By: Relevance
“…Text-only data can also be converted to TTS utterances for training the whole deliberation decoder. We employ JATD training [11] and scale it up using text data sampled from multiple domains, i.e., 51M, 20M, 1.6M, 0.6M, and 11M text sentences from Maps, News, Play, Search and YouTube domains, respectively. In comparison, [11] uses only 4.6M samples from the Maps domain.…”
Section: Large Scale Tts Trainingmentioning
confidence: 99%
See 4 more Smart Citations
“…Text-only data can also be converted to TTS utterances for training the whole deliberation decoder. We employ JATD training [11] and scale it up using text data sampled from multiple domains, i.e., 51M, 20M, 1.6M, 0.6M, and 11M text sentences from Maps, News, Play, Search and YouTube domains, respectively. In comparison, [11] uses only 4.6M samples from the Maps domain.…”
Section: Large Scale Tts Trainingmentioning
confidence: 99%
“…While LM relies on only text hypotheses for rescoring, deliberation models have been recently proposed for second-pass rescoring using both text hypotheses and audio [9,10]. Compared to LM training, there has been few attempts at incorporating widely available text-only or audio-only data in deliberation (see [11]). In this work, we research various ways to utilize large-scale textonly and semi-supervised data for deliberation training.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations