Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Deng, Keqi; Cao, Songjun; Zhang, Yike; Ma, Long

doi:10.1109/asru51503.2021.9688009

Cited by 19 publications

(12 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, some research has been conducted to integrate BERT into the ASR model [ 49 ]. In [ 50 ], K Deng et al initialize the encoder using wav2vec2.0 [ 51 ], and the decoder through a pre-trained LM DistilGPT2, to take full advantage of the pre-trained acoustic and language models.…”

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren

Yolwas

Slamu

et al. 2022

Sensors

View full text Add to dashboard Cite

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ren

Yolwas

Slamu

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”

Section: Related Workmentioning

confidence: 99%

“…The most straightforward method is to employ them as an acoustic feature encoder and then stack a simple layer of neural network on top of the encoder to do speech recognition [9]. After that, some studies present various cascade methods to concatenate pre-trained language and speech representation learning models for ASR [14,15,17,18]. Although these methods have proven their capabilities and effectiveness on benchmark corpora, their complicated model architectures and/or large-scaled model parameters have usually made them hard to be used in practice.…”

Section: Related Workmentioning

confidence: 99%

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

Non-autoregressive automatic speech recognition (ASR) modeling has received increasing attention recently because of its fast decoding speed and superior performance. Among representatives, methods based on the connectionist temporal classification (CTC) are still a dominating stream. However, the theoretically inherent flaw, the assumption of independence between tokens, creates a performance barrier for the school of works. To mitigate the challenge, we propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR. The former is designed to distill linguistic information from a pre-trained language model, and the latter is framed to modulate the limitations caused by the conditional independence assumption. As a result, a knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is presented in this paper. A series of experiments on the AISHELL-1 and AISHELL-2 datasets demonstrate the effectiveness of the proposed method.

show abstract

“…For text-only data, text is mainly used to train an external language model (LM) for joint decoding [11,12,13,14,15]. In order to make use of both unpaired speech and text, many methods have recently been proposed, e.g., integration of a pre-trained acoustic model and LM [16,17,18,19], cycle-consistency based dual-training [20,21,22,23], and shared representation learning [24,25,26,27], which rely on hybrid models with multitask training and some of which become less effective in cases with a very limited amount of labeled data. The current mainstream methods that achieve state-of-the-art (SOTA) results in low-resource ASR use unpaired speech and text for pre-training and training a LM for joint decoding, respectively [7,8], and adopt an additional iterative self-training [28].…”

Section: Introductionmentioning

confidence: 99%

A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

Du¹,

Zhang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Unpaired data has shown to be beneficial for low-resource automatic speech recognition (ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training. In this work, we leverage unpaired data to train a general sequence-to-sequence model. Unpaired speech and text are used in the form of data pairs by generating the corresponding missing parts in prior to model training. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair in both acoustic features and linguistic features, we propose a complementary joint training (CJT) method that trains a model alternatively with two data pairs. Furthermore, label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data, termed as CJT++. Experimental results show that compared to speech-only training, the proposed basic CJT achieves great performance improvements on clean/other test sets, and the CJT++ re-training yields further performance enhancements. It is also apparent that the proposed method outperforms the wav2vec2.0 model with the same model size and beam size, particularly in extreme lowresource cases.

show abstract

Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Cited by 19 publications

References 26 publications

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

A context-aware knowledge transferring strategy for CTC-based ASR

A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

Contact Info

Product

Resources

About