Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10062
|View full text |Cite
|
Sign up to set email alerts
|

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of textonly data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slows down the inference speed. In this study, we propose an error correction method with phone-conditioned masked LM (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 31 publications
0
1
0
Order By: Relevance
“…Similar to (Xiao et al, 2023), we can use multiple candidate lengths from the length prediction module and decode according to each in parallel. Another possibility is to look into fusion mechanisms with language models similar to (Futami et al, 2022). However, one thing to note is that in the on-device space which deliberation models target (Le et al, 2022), the added latency from larger beam size is rarely tolerated and beam size of 1 is often used.…”
Section: Discussionmentioning
confidence: 99%
“…Similar to (Xiao et al, 2023), we can use multiple candidate lengths from the length prediction module and decode according to each in parallel. Another possibility is to look into fusion mechanisms with language models similar to (Futami et al, 2022). However, one thing to note is that in the on-device space which deliberation models target (Le et al, 2022), the added latency from larger beam size is rarely tolerated and beam size of 1 is often used.…”
Section: Discussionmentioning
confidence: 99%
“…Finally, MLM-SC achieves impressive improvements on the Librispeech data with the proposed MS decoding. One recent work [31] uses phone2word conversion masked language model to achieve non-autoregressive spell correction. However, it does not perform well on English tasks.…”
Section: Introductionmentioning
confidence: 99%