Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2404
|View full text |Cite
|
Sign up to set email alerts
|

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Abstract: We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant numbe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
93
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 107 publications
(94 citation statements)
references
References 21 publications
1
93
0
Order By: Relevance
“…Due to the conditional independence assumption, the CTC-based model generally suffers from poor recognition performance [6]. Mask-CTC has been proposed to mitigate this problem by iteratively refining an output of CTC with bi-directional contexts of tokens [29]. Mask-CTC adopts an encoder-decoder model built upon Transformer blocks [16].…”
Section: Mask-ctcmentioning
confidence: 99%
See 2 more Smart Citations
“…Due to the conditional independence assumption, the CTC-based model generally suffers from poor recognition performance [6]. Mask-CTC has been proposed to mitigate this problem by iteratively refining an output of CTC with bi-directional contexts of tokens [29]. Mask-CTC adopts an encoder-decoder model built upon Transformer blocks [16].…”
Section: Mask-ctcmentioning
confidence: 99%
“…NAR models in neural machine translation have achieved competitive performance to AR models [18][19][20][21][22][23][24][25]. Several approaches have been proposed to realize a NAR model in E2E-ASR [26][27][28][29][30][31]. Connectionist temporal classification (CTC) predicts a frame-wise latent alignment between input speech frames and output tokens, and generates a target sequence based on a conditional independence assumption between the frame predictions [26].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Prevailing iterative NATs regard the decoder as a masked language model, where the tokens with low confidence are first masked and then new tokens are predicted from unmasked tokens. The two steps are conducted alternately within a constant number of iterations [13][14][15]. Length prediction of the decoder input is difficult for NAT models.…”
Section: Introductionmentioning
confidence: 99%
“…Length prediction of the decoder input is difficult for NAT models. To address the issue, Higuchi et al used the length of recognition results of CTC [14], while Chan et al proposed a model named as Imputer, which directly used the length of input feature sequence [16]. Compared to the left-to-right generation order in an autoregressive transformer (AT), an iterative NAT essentially adopts a different token generation order, where the iterations are still required.…”
Section: Introductionmentioning
confidence: 99%