2023
DOI: 10.1109/taslp.2021.3133217
|View full text |Cite
|
Sign up to set email alerts
|

Alignment Knowledge Distillation for Online Streaming Attention-Based Speech Recognition

Abstract: This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 101 publications
0
5
0
Order By: Relevance
“…We also compared our method to knowledge distillation (KD) with MLM. To apply KD, we utilized a forced-aligned CTC path to align frame-level predictions of CTC with tokenlevel predictions of MLM, inspired by [40]. The KD loss function is formulated as…”
Section: Resultsmentioning
confidence: 99%
“…We also compared our method to knowledge distillation (KD) with MLM. To apply KD, we utilized a forced-aligned CTC path to align frame-level predictions of CTC with tokenlevel predictions of MLM, inspired by [40]. The KD loss function is formulated as…”
Section: Resultsmentioning
confidence: 99%
“…To solve the problem, we propose to use the forced alignment result, that is, the most plausible CTC path π = (π1, ..., πT ) of π ∈ B −1 (y) like [26], where the CTC path π is used to enable a monotonic chunkwise attention (MoChA) model to learn optimal alignments for streaming ASR. π can be obtained by tracking the path that has the maximum products of forward and backward variables, which can be obtained in the process of calculating the CTC loss function LCTC from Eq.…”
Section: Proposed Method: Distilling the Knowledge Of Bert For Ctc-ba...mentioning
confidence: 99%
“…Several studies have leveraged linguistic and acoustic information from pretrained ASR models to enhance ASR performance [35]- [37]. The KD approach using pretrained ASR models can be categorized into decoder-and encoderside methods [36].…”
Section: B Kd For E2e-based Asr Modelmentioning
confidence: 99%
“…The decoder-side KD method, which uses the output vector of the ASR decoder, transfers the global context to the student model, because the decoder output vector integrates the latent vectors from the ASR encoder with a large context window to capture the linguistic unit information. Thus, the decoder-side KD method maximizes transition probabilities between linguistic units, resulting in its inability to handle the frame-wise characteristics inherent in the ASR encoder [36], [37]. Conversely, the encoder-side KD method uses the output vector of the ASR encoder optimized in a frame-wise fashion with a fixed-size sliding window.…”
Section: B Kd For E2e-based Asr Modelmentioning
confidence: 99%