Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Higuchi, Yuki; Watanabe, Shinji; Chen, Nanxin; Ogawa, Tetsuji; Kobayashi, Tetsuo

doi:10.21437/interspeech.2020-2404

Cited by 107 publications

(94 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the conditional independence assumption, the CTC-based model generally suffers from poor recognition performance [6]. Mask-CTC has been proposed to mitigate this problem by iteratively refining an output of CTC with bi-directional contexts of tokens [29]. Mask-CTC adopts an encoder-decoder model built upon Transformer blocks [16].…”

Section: Mask-ctcmentioning

confidence: 99%

“…NAR models in neural machine translation have achieved competitive performance to AR models [18][19][20][21][22][23][24][25]. Several approaches have been proposed to realize a NAR model in E2E-ASR [26][27][28][29][30][31]. Connectionist temporal classification (CTC) predicts a frame-wise latent alignment between input speech frames and output tokens, and generates a target sequence based on a conditional independence assumption between the frame predictions [26].…”

Section: Introductionmentioning

confidence: 99%

“…Despite achieving comparable performance with that of AR models, Imputer processes the frame-level sequence, consisting of hundreds of units, using the self-attention layers [16], which cost computations proportional to the square of a sequence length. On the other hand, Mask-CTC [29] generates a sequence shorter than Imputer by refining a token-level CTC output with the mask prediction, achieving fast inference speed of less than 0.1 real time factor (RTF) using CPU.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

Higuchi

Inaguma

Watanabe

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% → 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.

show abstract

Section: Mask-ctcmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

Higuchi

Inaguma

Watanabe

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Prevailing iterative NATs regard the decoder as a masked language model, where the tokens with low confidence are first masked and then new tokens are predicted from unmasked tokens. The two steps are conducted alternately within a constant number of iterations [13][14][15]. Length prediction of the decoder input is difficult for NAT models.…”

Section: Introductionmentioning

confidence: 99%

“…Length prediction of the decoder input is difficult for NAT models. To address the issue, Higuchi et al used the length of recognition results of CTC [14], while Chan et al proposed a model named as Imputer, which directly used the length of input feature sequence [16]. Compared to the left-to-right generation order in an autoregressive transformer (AT), an iterative NAT essentially adopts a different token generation order, where the iterations are still required.…”

Section: Introductionmentioning

confidence: 99%

CASS-NAT: CTC Alignment-Based Single Step Non-Autoregressive Transformer for Speech Recognition

Fan

Chu²,

Chang³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively 1 . Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.

show abstract

Analysis of Mispronunciation Detection and Diagnosis Based on Conventional Deep Learning Techniques

Soundarya,

Anusuya

2024

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

Cited by 107 publications

References 21 publications

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

CASS-NAT: CTC Alignment-Based Single Step Non-Autoregressive Transformer for Speech Recognition

Analysis of Mispronunciation Detection and Diagnosis Based on Conventional Deep Learning Techniques

Contact Info

Product

Resources

About