Speech Recognition by Simply Fine-Tuning Bert

Huang, Wen-Chin; Wu, Chia-Hua; Luo, Shang-Bao; Chen, Kuan‐Yu; Wang, Hsin‐Min; Toda, Tomoki

doi:10.1109/icassp39728.2021.9413668

Cited by 33 publications

(41 citation statements)

References 12 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”

Section: Related Workmentioning

confidence: 99%

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

Non-autoregressive automatic speech recognition (ASR) modeling has received increasing attention recently because of its fast decoding speed and superior performance. Among representatives, methods based on the connectionist temporal classification (CTC) are still a dominating stream. However, the theoretically inherent flaw, the assumption of independence between tokens, creates a performance barrier for the school of works. To mitigate the challenge, we propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR. The former is designed to distill linguistic information from a pre-trained language model, and the latter is framed to modulate the limitations caused by the conditional independence assumption. As a result, a knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is presented in this paper. A series of experiments on the AISHELL-1 and AISHELL-2 datasets demonstrate the effectiveness of the proposed method.

show abstract

Section: Related Workmentioning

confidence: 99%

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, even with the pre-trained model obtained by wav2vec2.0, the CTC model needs an external language model (LM) to relax its conditional independence assumption [9,10]. Several works have investigated incorporating BERT into a NAR ASR model to achieve better recognition accuracies [11][12][13]. In order to bridge the length gap between the frame-level speech input and token-level text output, [11] and [12] have introduced global attention and a serial continuous integrate-and-fire (CIF) [14], respectively.…”

Section: Introductionmentioning

confidence: 99%

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Deng¹,

Zhang²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pretrained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

show abstract

“…Such pre-trained models have been shown to improve diverse NLP tasks, alleviating the heavy requirement of supervised training data. Inspired by the great success in NLP, pre-trained LMs have been actively adopted for speech processing tasks, including automatic speech recognition (ASR) (Shin et al, 2019;Huang et al, 2021), spoken language understanding (SLU) (Chuang et al, 2020;Chung et al, 2021), and text-to-speech synthesis (Hayashi et al, 2019;Kenter et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…Several attempts have been made to use pretrained LMs indirectly for improving E2E-ASR, such as N-best hypothesis rescoring (Shin et al, 2019;Salazar et al, 2020;Chiu and Chen, 2021;Futami et al, 2021;Udagawa et al, 2022) and knowledge distillation (Futami et al, 2020;Bai et al, 2021;Kubo et al, 2022). Others have investigated directly unifying an E2E-ASR model with a pre-trained LM, where the LM is fine-tuned to optimize ASR in an end-to-end trainable framework (Huang et al, 2021;Zheng et al, 2021;Deng et al, 2021;Yu et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Higuchi¹,

Yan²,

Arora³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.

show abstract

Speech Recognition by Simply Fine-Tuning Bert

Cited by 33 publications

References 12 publications

A context-aware knowledge transferring strategy for CTC-based ASR

A context-aware knowledge transferring strategy for CTC-based ASR

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Contact Info

Product

Resources

About