Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Futami, Hayato; Inaguma, Hirofumi; Ueno, Sei; Mimura, Mamoru; Sakai, Shinsuke; Kawahara, Tatsuya

doi:10.21437/interspeech.2020-1179

Cited by 41 publications

(22 citation statements)

References 37 publications

(63 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BERTalsem remains competitive compared to GPT-2 rescoring, and the best performance is obtained by BERTalsem with ac./GPT-2 scores. Future work will include the introduction of an attention mechanism and context information beyond the utterance level [5].…”

Section: Discussionmentioning

confidence: 99%

BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List

Fohr¹,

Illina²

2021

Interspeech 2021

View full text Add to dashboard Cite

This work aims to improve automatic speech recognition (ASR) by modeling long-term semantic relations. We propose to perform this through rescoring the ASR N-best hypotheses list. To achieve this, we propose two deep neural network (DNN) models and combine semantic, acoustic, and linguistic information. Our DNN rescoring models are aimed at selecting hypotheses that have better semantic consistency and therefore lower WER. We investigate a powerful representation as part of input features to our DNN model: dynamic contextual embeddings from Transformer-based BERT. Acoustic and linguistic features are also included. We perform experiments on the publicly available dataset TED-LIUM. We evaluate in clean and in noisy conditions, with n-gram and Recurrent Neural Network Language Model (RNNLM), more precisely Long Short-Term Memory (LSTM) model. The proposed rescoring approaches give significant WER improvements over the ASR system without rescoring models. Furthermore, the combination of rescoring methods based on BERT and GPT-2 scores achieves the best results.

show abstract

Section: Discussionmentioning

confidence: 99%

BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List

Fohr¹,

Illina²

2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…The original usage of BERT mainly focused on NLP tasks, ranging from token-level and sequence-level classification tasks, including question answering [9,10], document summarization [11,12], information retrieval [13,14], machine translation [15,16], just to name a few. There has also been attempts to combine BERT in ASR, including rescoring [17,18] or generating soft labels for training [19]. In this section, we review the fundamentals of BERT.…”

Section: Bertmentioning

confidence: 99%

Speech Recognition by Simply Fine-Tuning Bert

Huang

Luo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance.

show abstract

“…Representative methods are used to distill knowledge from an external language model to improve the capturing of linguistic contexts [24,25]. Our proposed large-context knowledge distillation method is regarded as an extension of the latter methods to enable the capturing of all preceding linguistic contexts beyond utterance boundaries using large-context language models [9-13].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Masumura

Makishima

Ihori

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utterance boundaries, well handle a sequence of utterances such as discourses and conversations. However, the transformer architecture, which has recently achieved state-of-the-art ASR performance among utterance-level ASR systems, has not yet been introduced into the large-context ASR systems. We can expect that the transformer architecture can be leveraged for effectively capturing not only input speech contexts but also long-range sequential contexts beyond utterance boundaries. Therefore, this paper proposes a hierarchical transformer-based large-context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling. In addition, in order to enable the proposed model to use long-range sequential contexts, we also propose a large-context knowledge distillation that distills the knowledge from a pre-trained large-context language model in the training phase. We evaluate the effectiveness of the proposed model and proposed training method on Japanese discourse ASR tasks.

show abstract

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Cited by 41 publications

References 37 publications

BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List

BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List

Speech Recognition by Simply Fine-Tuning Bert

Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation

Contact Info

Product

Resources

About