2019
DOI: 10.1109/taslp.2019.2929859
|View full text |Cite
|
Sign up to set email alerts
|

General Sequence Teacher–Student Learning

Abstract: In automatic speech recognition, performance gains can often be obtained by combining an ensemble of multiple models. However, this can be computationally expensive when performing recognition. Teacher-student learning alleviates this cost by training a single student model to emulate the combined ensemble behaviour. Only this student needs to be used for recognition. Previously investigated teacher-student criteria often limit the forms of diversity allowed in the ensemble, and only propagate information from… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…where X is an input sequence, θT is the teacher, θS is the student, and H is a hypothesis, which may be expressed as a sequence of words, sub-word units, or states [27]. This study considers state sequences.…”
Section: Boosting Accuracy With Sequence-level Teacher-student Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…where X is an input sequence, θT is the teacher, θS is the student, and H is a hypothesis, which may be expressed as a sequence of words, sub-word units, or states [27]. This study considers state sequences.…”
Section: Boosting Accuracy With Sequence-level Teacher-student Learningmentioning
confidence: 99%
“…where α k is the combination weight for the kth teacher, θT k , satisfying k α k = 1 and α k ≥ 0. The contribution to the T/S gradient from the teachers can be obtained by performing a separate forwardbackward operation over each of the teachers' lattices, which represent each teachers' hypotheses [27]. This is computationally expensive, especially when using a large amount of training data.…”
Section: Boosting Accuracy With Sequence-level Teacher-student Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Wu et al [29] proposed a multi-teacher knowledge distillation framework to compress the model by transferring the knowledge from multiple teachers to a single student model. Wong et al [30] proposed to train the student by an ensemble with a diversity of state cluster sets. Simultaneous distillation algorithms [31] trained simultaneously a set of models that learn from each other in a peer-teaching manner.…”
Section: Related Workmentioning
confidence: 99%