Distilling Attention Weights for CTC-Based ASR Systems

Moriya, Takafumi; Satō, Hiroshi; Tanaka, Tomohiro; Ashihara, Takanori; Masumura, Ryo; Shinohara, Yusuke

doi:10.1109/icassp40776.2020.9053578

Cited by 11 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, we need a normalisation procedure, which makes the final loss function similar to MMI. There have been some other work of modifying CTC [13,14,15] but we have not work looking at the aspect of topology.…”

Section: Discussionmentioning

confidence: 99%

Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR

Zhao

Bell

2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end Automatic Speech Recognition (E2E ASR) significantly simplifies the training process of an ASR model. Connectionist Temporal Classification (CTC) is one of the most popular methods for E2E ASR training. Implicitly, CTC has a unique topology which is very useful for sequence modelling. However, we find that by changing to another topology, we can make it even more effective. In this paper, we propose a new CTC-like method, for E2E ASR training, by modifying the topology of original CTC, so that the wellknown abuse of the blank label in CTC can be resolved theoretically. As we change the topology, a normalisation term is necessary, which makes the form of the final loss function similar to Maximum Mutual Information (MMI); we hence name our method MMI-CTC. In addition to maximising the posterior probability of the target sequence, the normalisation enables models to explicitly minimise the probability of competing hypothesis at the word sequence level. Our experimental results show that MMI-CTC is more efficient than CTC, and that the normalisation is essential for sequence training.

show abstract

Section: Discussionmentioning

confidence: 99%

Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR

Zhao

Bell

2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Gao et al [16] use multiple teacher models to train a student ASR model to improve ASR accuracy jointly. Moriya et al [17] improve the accuracy by adding another term to the loss function called selfdistillation (SD), which comes from incorporating the teacher model.…”

Section: Classic Compression Methods In Language Modelingmentioning

confidence: 99%

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

Peng¹,

Budhkar²,

Tuil³

et al. 2021

Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster, 4.8 times smaller than the original model. More importantly, the student model is 2 times more energy efficient than the original model in terms of CO 2 emission. This increase in performance is accomplished with only a 7% degradation in word error rate (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1% degradation in WER. To the best of our knowledge, this is the first work that compresses wav2vec 2.0.

show abstract

“…Distillation between different decoder topologies has also been investigated. Moriya et al distilled knowledge from a teacher AED model to a student CTC model [79]. Self-distillation in a single E2E ASR model has also been proposed as an in-place operation, from an offline mode to a streaming mode [61], [68], and from a Transformer decoder to a CTC layer [80].…”

Section: Knowledge Distillation For Streaming Asrmentioning

confidence: 99%

Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition

Inaguma

Kawahara

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows recognition accuracy comparable to that of RNNtransducer (RNN-T) while achieving lower emission latency.

show abstract

Distilling Attention Weights for CTC-Based ASR Systems

Cited by 11 publications

References 17 publications

Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR

Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition

Contact Info

Product

Resources

About