Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1179
|View full text |Cite
|
Sign up to set email alerts
|

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generates soft labels to guide the training of seq2seq ASR. Furthermore, we leverage context beyond the current utterance … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 41 publications
(22 citation statements)
references
References 37 publications
(63 reference statements)
0
21
0
Order By: Relevance
“…BERTalsem remains competitive compared to GPT-2 rescoring, and the best performance is obtained by BERTalsem with ac./GPT-2 scores. Future work will include the introduction of an attention mechanism and context information beyond the utterance level [5].…”
Section: Discussionmentioning
confidence: 99%
“…BERTalsem remains competitive compared to GPT-2 rescoring, and the best performance is obtained by BERTalsem with ac./GPT-2 scores. Future work will include the introduction of an attention mechanism and context information beyond the utterance level [5].…”
Section: Discussionmentioning
confidence: 99%
“…The original usage of BERT mainly focused on NLP tasks, ranging from token-level and sequence-level classification tasks, including question answering [9,10], document summarization [11,12], information retrieval [13,14], machine translation [15,16], just to name a few. There has also been attempts to combine BERT in ASR, including rescoring [17,18] or generating soft labels for training [19]. In this section, we review the fundamentals of BERT.…”
Section: Bertmentioning
confidence: 99%
“…Representative methods are used to distill knowledge from an external language model to improve the capturing of linguistic contexts [24,25]. Our proposed large-context knowledge distillation method is regarded as an extension of the latter methods to enable the capturing of all preceding linguistic contexts beyond utterance boundaries using large-context language models [9-13].…”
Section: Related Workmentioning
confidence: 99%