ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413668
|View full text |Cite
|
Sign up to set email alerts
|

Speech Recognition by Simply Fine-Tuning Bert

Abstract: We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
22
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(41 citation statements)
references
References 12 publications
(13 reference statements)
0
22
0
Order By: Relevance
“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”
Section: Related Workmentioning
confidence: 99%
“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”
Section: Related Workmentioning
confidence: 99%
“…However, even with the pre-trained model obtained by wav2vec2.0, the CTC model needs an external language model (LM) to relax its conditional independence assumption [9,10]. Several works have investigated incorporating BERT into a NAR ASR model to achieve better recognition accuracies [11][12][13]. In order to bridge the length gap between the frame-level speech input and token-level text output, [11] and [12] have introduced global attention and a serial continuous integrate-and-fire (CIF) [14], respectively.…”
Section: Introductionmentioning
confidence: 99%
“…Such pre-trained models have been shown to improve diverse NLP tasks, alleviating the heavy requirement of supervised training data. Inspired by the great success in NLP, pre-trained LMs have been actively adopted for speech processing tasks, including automatic speech recognition (ASR) (Shin et al, 2019;Huang et al, 2021), spoken language understanding (SLU) (Chuang et al, 2020;Chung et al, 2021), and text-to-speech synthesis (Hayashi et al, 2019;Kenter et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Several attempts have been made to use pretrained LMs indirectly for improving E2E-ASR, such as N-best hypothesis rescoring (Shin et al, 2019;Salazar et al, 2020;Chiu and Chen, 2021;Futami et al, 2021;Udagawa et al, 2022) and knowledge distillation (Futami et al, 2020;Bai et al, 2021;Kubo et al, 2022). Others have investigated directly unifying an E2E-ASR model with a pre-trained LM, where the LM is fine-tuned to optimize ASR in an end-to-end trainable framework (Huang et al, 2021;Zheng et al, 2021;Deng et al, 2021;Yu et al, 2022).…”
Section: Introductionmentioning
confidence: 99%