ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682775
|View full text |Cite
|
Sign up to set email alerts
|

Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 29 publications
(17 citation statements)
references
References 10 publications
0
15
0
Order By: Relevance
“…The multi-head transformer-based sequence-to-sequence model [18,2,19] was trained with the tensorflow framework. For the ASR model, the encoder consists of 10 layers of self-attention block with 768 nodes and 3,072 filter size, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…The multi-head transformer-based sequence-to-sequence model [18,2,19] was trained with the tensorflow framework. For the ASR model, the encoder consists of 10 layers of self-attention block with 768 nodes and 3,072 filter size, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…The comparison results on TIMIT and LibriSpeech of our proposed method with state-of-the-art methods are shown in Table 7. Takashima et al [37] investigated sequence-level [41] proposed to add an exponential weight coefficient to the sequence-level knowledge distillation method to balance the recognition quality of the teacher model. We chose the DeepSpeech2 as the teacher model, selected two layers of LSTM as the student model, and experimented them with the Sequence-level KD, Essence KD and ERR-KD.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Kim et at. [41] added an exponential weight to the sequence-level knowledge distillation method, which reflects the quality of the teacher model output by the weighing scheme to minimize the knowledge distillation loss function. Meng et al [42] proposed conditional teacher-student framework, in which the student model selectively chooses to learn from either the ground truth labels or the outputs of the teacher model.…”
Section: Related Workmentioning
confidence: 99%
“…We used 4096 word-pieces as output token units. For E2E model training, we used same hyper-parameters in [3]. All experiments used the identical input feature processing to that of [24].…”
Section: Methodsmentioning
confidence: 99%
“…This is because Transformer has an advantage in terms of computation and parallelism over recurrent neural network (RNN) based models. In addition, knowledge distillation has been studied to create parameter efficient models [3,4]. Shallow fusion of the E2E ASR models and external language models (LM) also showed a further improvement in WER [5,6], because external LMs are able to learn more contextual information from abundant text-only data.…”
Section: Introductionmentioning
confidence: 99%