Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Futami, Hayato; Inaguma, Hirofumi; Ueno, Sei; Mimura, Mamoru; Sakai, Shinsuke; Kawahara, Tatsuya

doi:10.21437/interspeech.2022-10062

Cited by 7 publications

(2 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to (Xiao et al, 2023), we can use multiple candidate lengths from the length prediction module and decode according to each in parallel. Another possibility is to look into fusion mechanisms with language models similar to (Futami et al, 2022). However, one thing to note is that in the on-device space which deliberation models target (Le et al, 2022), the added latency from larger beam size is rarely tolerated and beam size of 1 is often used.…”

Section: Discussionmentioning

confidence: 99%

Deliberation Model for On-Device Spoken Language Understanding

Le¹,

Shrivastava²,

Tomasello³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust nonautoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.

show abstract

Section: Discussionmentioning

confidence: 99%

Deliberation Model for On-Device Spoken Language Understanding

Le¹,

Shrivastava²,

Tomasello³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Finally, MLM-SC achieves impressive improvements on the Librispeech data with the proposed MS decoding. One recent work [31] uses phone2word conversion masked language model to achieve non-autoregressive spell correction. However, it does not perform well on English tasks.…”

Section: Introductionmentioning

confidence: 99%

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Fan¹,

Ye²,

Gaur³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we take one step further, and explore the possibility of using MLM as a non-autoregressive spell correction (SC) model for transformertransducer (TT), denoted as MLM-SC. Our initial experiments show that MLM-SC provides no improvements on Librispeech data. The problem might be the choice of modeling units (word pieces) and the inaccuracy of the TT confidence scores for English data. To solve the problem, we propose a mask sample decoding (MS-decode) method where the masked tokens can have the choice of being masked or not to compensate for the inaccuracy. As a result, we reduce the WER of a streaming TT from 7.6% to 6.5% on the Librispeech testother data and the CER from 7.3% to 6.1% on the Aishell test data, respectively.

show abstract