Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition

Inaguma, Hirofumi; Kawahara, Tatsuya

doi:10.1109/taslp.2021.3133217

Cited by 7 publications

(5 citation statements)

References 101 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also compared our method to knowledge distillation (KD) with MLM. To apply KD, we utilized a forced-aligned CTC path to align frame-level predictions of CTC with tokenlevel predictions of MLM, inspired by [40]. The KD loss function is formulated as…”

Section: Resultsmentioning

confidence: 99%

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Futami¹,

Inaguma²,

Ueno³

et al. 2022

Interspeech 2022

Self Cite

View full text Add to dashboard Cite

Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of textonly data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slows down the inference speed. In this study, we propose an error correction method with phone-conditioned masked LM (PC-MLM). In the proposed method, less confident word tokens in a greedy decoded output from CTC are masked. PC-MLM then predicts these masked word tokens given unmasked words and phones supplementally predicted from CTC. We further extend it to Deletable PC-MLM in order to address insertion errors. Since both CTC and PC-MLM are non-autoregressive models, the method enables fast LM integration. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 in domain adaptation setting shows that our proposed method outperformed rescoring and shallow fusion in terms of inference speed, and also in terms of recognition accuracy on CSJ.

show abstract

Section: Resultsmentioning

confidence: 99%

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Futami¹,

Inaguma²,

Ueno³

et al. 2022

Interspeech 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…To solve the problem, we propose to use the forced alignment result, that is, the most plausible CTC path π = (π1, ..., πT ) of π ∈ B −1 (y) like [26], where the CTC path π is used to enable a monotonic chunkwise attention (MoChA) model to learn optimal alignments for streaming ASR. π can be obtained by tracking the path that has the maximum products of forward and backward variables, which can be obtained in the process of calculating the CTC loss function LCTC from Eq.…”

Section: Proposed Method: Distilling the Knowledge Of Bert For Ctc-ba...mentioning

confidence: 99%

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Futami¹,

Inaguma²,

Ueno³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generates soft labels to guide the training of seq2seq ASR. Furthermore, we leverage context beyond the current utterance as input to BERT. Experimental evaluations show that our method significantly improves the ASR performance from the seq2seq baseline on the Corpus of Spontaneous Japanese (CSJ). Knowledge distillation from BERT outperforms that from a transformer LM that only looks at left context. We also show the effectiveness of leveraging context beyond the current utterance. Our method outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.

show abstract

“…Several studies have leveraged linguistic and acoustic information from pretrained ASR models to enhance ASR performance [35]- [37]. The KD approach using pretrained ASR models can be categorized into decoder-and encoderside methods [36].…”

Section: B Kd For E2e-based Asr Modelmentioning

confidence: 99%

“…The decoder-side KD method, which uses the output vector of the ASR decoder, transfers the global context to the student model, because the decoder output vector integrates the latent vectors from the ASR encoder with a large context window to capture the linguistic unit information. Thus, the decoder-side KD method maximizes transition probabilities between linguistic units, resulting in its inability to handle the frame-wise characteristics inherent in the ASR encoder [36], [37]. Conversely, the encoder-side KD method uses the output vector of the ASR encoder optimized in a frame-wise fashion with a fixed-size sliding window.…”

Section: B Kd For E2e-based Asr Modelmentioning

confidence: 99%

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Woo Lee,

Kook Kim,

Kong

2024

IEEE Access

View full text Add to dashboard Cite

This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training approaches for a pipeline comprising speech enhancement (SE) and end-to-end ASR model surfer from a conflicting problem and a frame mismatched alignment problem because of different goals and different frame structures for ASR and SE. To mitigate such problems, a knowledge distillation (KD)-based training approach is proposed by interpreting the ASR and SE models in the pipeline as teacher and student models, respectively. In the proposed KD-based training approach, the ASR model is first trained using a training dataset, and then, acoustic tokens are generated via K-means clustering using the latent vectors of the ASR encoder. Thereafter, KD-based training of the SE model is performed using the generated acoustic tokens. The performance of the SE and ASR models is evaluated on two different databases, noisy LibriSpeech and CHiME-4, which correspond to simulated and real-world noise conditions, respectively. The experimental results show that the proposed KD-based training approach yields a lower character error rate (CER) and word error rate (WER) on the two datasets than conventional joint training approaches, including multicondition training. The results also show that the speech quality scores of the SE model trained using the proposed training approach are higher than those of SE models trained using conventional training approaches. Moreover, the noise reduction scores of the proposed training approach are higher than those of conventional joint training approaches but slightly lower than those of the standalone-SE training approach. Finally, an ablation study is conducted to examine the contribution of different combinations of loss functions in the proposed training approach to SE and ASR performance. The results show that the combination of all loss functions yields the lowest CER and WER and that tokenizer loss contributes more to SE and ASR performance improvement than ASR encoder loss.INDEX TERMS Noise-robust automatic speech recognition, speech enhancement, knowledge distillation, teacher-student model, acoustic tokenizer, K-means clustering. I. INTRODUCTIONRecent advancements in neural network architectures and training approaches have demonstrated consistent progress, enhancing capabilities in not only image and natural language processing but also audio and speech signal processing. Speech processing encompasses the comprehensive analysis, synthesis, and recognition of speech, including speaker verification, speech separation, speech enhancement (SE), speech synthesis, and automatic speech recognition (ASR) [1], [2]. ASR has gained considerable attention, particularly in voice-based information retrieval systems, chatbots, and automated transcription systems [3]. Moreover, the interest in ASR, specifically for real-world scenarios, such as closed This article has been accepted for publication in IEEE Access.

show abstract

Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition

Cited by 7 publications

References 101 publications

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition

Contact Info

Product

Resources

About