Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models

Deng, Keqi; Cao, Songjun; Zhang, Yike; Ma, Long; Cheng, Gaofeng; Ji, Xu; Zhang, Pengyuan

doi:10.1109/icassp43922.2022.9747887

Cited by 14 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these models often slow down the decoding speed and usually have a large set of model parameters. On the other hand, a school of research makes the ASR model to learn linguistic information from pre-trained language models in a teacher-student training manner [20,21,7,22,23]. These models still obtain a fast decoding speed, but their improvements are usually incremental.…”

Section: Related Workmentioning

confidence: 99%

“…where the objective function L KT is defined to minimize the cosine embedding loss, and a scaling hyper-parameter k is used to equalize the numerical imbalance between the cosine embedding loss and other losses [7]. The index 0 and N + 1 denote the positions of the special tokens, which are ignored in calculating the training loss.…”

Section: Token-dependent Knowledge Transferring Modulementioning

confidence: 99%

“…The batch size is set to 32 and the gradients are accumulated over 2 updates. We use the Adam optimizer with warm-up scheduler (25, [15] 4.8 5.3 KT-CL [7] 5.0 5.2 KT-RL-ATT [7] 4.6 4.8 rePLM-NAR-ASR [13] 4.2 4.8 KT-RL-CIF [7] 4.3 4.7 NAR CTC/attention [15] 4.1 4.5 Wav-BERT [18] 3.6 3.8…”

Section: Experiments Setupmentioning

confidence: 99%

“…They are not only required to model the acoustic information from the speech signal but needed to generate a precise token sequence corresponding to the speech and the contextual coherence. In recent years, connectionist temporal classification (CTC)-based ASR systems [1] have attracted significant attention since they can achieve a much faster decoding speed in the non-autoregressive manner and obtain competitive or even better performance compared to the conventional auto-regressive models [2,3,4,5,6,7,8]. To be specific, a standard CTC-based ASR usually consists of a multi-layer Transformer-based acoustic encoder and a classification head based on some layers of simple feedforward neural network.…”

Section: Introductionmentioning

confidence: 99%

“…The contextualized CTC loss [6] is proposed to guild the model learn contextualized information by introducing extra prediction heads to predict surrounding tokens. Some studies aspire to improve CTC-based ASR via knowledge transferring from pre-trained language models [7].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

Non-autoregressive automatic speech recognition (ASR) modeling has received increasing attention recently because of its fast decoding speed and superior performance. Among representatives, methods based on the connectionist temporal classification (CTC) are still a dominating stream. However, the theoretically inherent flaw, the assumption of independence between tokens, creates a performance barrier for the school of works. To mitigate the challenge, we propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR. The former is designed to distill linguistic information from a pre-trained language model, and the latter is framed to modulate the limitations caused by the conditional independence assumption. As a result, a knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is presented in this paper. A series of experiments on the AISHELL-1 and AISHELL-2 datasets demonstrate the effectiveness of the proposed method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Token-dependent Knowledge Transferring Modulementioning

confidence: 99%

Section: Experiments Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

show abstract

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Zhao,

Li,

Tian

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6$$\%$$ % reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.

show abstract

Cross-Modal Learning for CTC-Based ASR: Leveraging CTC-Bertscore and Sequence-Level Training

Lee,

Choi

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models

Cited by 14 publications

References 22 publications

A context-aware knowledge transferring strategy for CTC-based ASR

A context-aware knowledge transferring strategy for CTC-based ASR

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Cross-Modal Learning for CTC-Based ASR: Leveraging CTC-Bertscore and Sequence-Level Training

Contact Info

Product

Resources

About