Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1223
|View full text |Cite
|
Sign up to set email alerts
|

Self-Distillation for Improving CTC-Transformer-Based ASR Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 24 publications
0
13
0
Order By: Relevance
“…For the LAS model, a WER reduction of 0.28% absolute (from 5.7% to 5.42%) on dev93 corresponds to a relative WER reduction of 7.4% on eval92, when using relaxed attention with γ = 0.1. For the transformer model, the best average result on dev93 is achieved with γ = 0.35, yielding an average WER of 5.80% on dev93 and 3.65% on eval92, which is an 18% relative improvement on eval92 compared to our own Baseline model (4.45%), and exceeds the current WSJ transformer state of the art by Moriya et al [25] (4.20%) by 13.1% relative, without adding any model complexity.…”
Section: Wall Street Journal (Wsj)mentioning
confidence: 55%
See 1 more Smart Citation
“…For the LAS model, a WER reduction of 0.28% absolute (from 5.7% to 5.42%) on dev93 corresponds to a relative WER reduction of 7.4% on eval92, when using relaxed attention with γ = 0.1. For the transformer model, the best average result on dev93 is achieved with γ = 0.35, yielding an average WER of 5.80% on dev93 and 3.65% on eval92, which is an 18% relative improvement on eval92 compared to our own Baseline model (4.45%), and exceeds the current WSJ transformer state of the art by Moriya et al [25] (4.20%) by 13.1% relative, without adding any model complexity.…”
Section: Wall Street Journal (Wsj)mentioning
confidence: 55%
“…One effective method to deal with overconfidence in end-to-end ASR, introduced in [20], is label smoothing that blends the one-hot label with a uniform distribution or assigns part of the probability mass to tokens that are neighbors of the labeled token in the target sequence [14]. Interestingly, label smoothing is also effective against overfitting [21] and thus is commonly used for AED end-to-end ASR besides related regularization methods such as spectral augmentation [22], dropout [23], multi-task learning with an additional CTC loss [24,25], and the recently proposed multi-encoder-learning that uses additional encoders only during training [26]. Regularization methods applied to the crucial encoder-decoder attention mechanism in AED models were only recently discovered in [27], where CTC predictions in a multi-task learning setup are used to focus the attention in transformer models [27] to relevant frames in the encoded input sequence.…”
Section: Encoder Decodermentioning
confidence: 99%
“…Combining both improved MEL-based models in the MEL-t-Fusion-Late approach yields the lowest WER of 3.40% on eval92, corresponding to a WER reduction of 0.91% absolute w.r.t. the normal Fusion-Late approach (4.31%), and a remarkable reduction of up to 19% relative compared to the best recently published transformer-based approach by Moriya et al [28], as shown in Table 2.…”
Section: Recognition Results and Discussionmentioning
confidence: 70%
“…Concerning the fusion of additional information into endto-end transformer models, the few existing approaches stem from audiovisual automatic speech recognition [24,25] and neural machine translation [26], where additional encoders are used to gather visual speech information or contextual information, respectively. Recent successful non-fusion techniques for end-to-end models are multi-task learning, e.g., by using a combination of CTC and attention-based losses [27,28], and augmentation techniques such as spectral augmentation [29]. Those methods improve neural networks by adding more variety to the trained models either by composite losses or by randomly withholding information in the input features.…”
Section: Introductionmentioning
confidence: 99%
“…Moriya et al distilled knowledge from a teacher AED model to a student CTC model [79]. Self-distillation in a single E2E ASR model has also been proposed as an in-place operation, from an offline mode to a streaming mode [61], [68], and from a Transformer decoder to a CTC layer [80]. Unlike those previous methods, we focus on distilling the positions of token boundaries learned in a CTC model to an AED model, rather than distilling the posterior distributions.…”
Section: Knowledge Distillation For Streaming Asrmentioning
confidence: 99%