2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021
DOI: 10.1109/asru51503.2021.9688009
|View full text |Cite
|
Sign up to set email alerts
|

Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
12
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(12 citation statements)
references
References 26 publications
0
12
0
Order By: Relevance
“…Recently, some research has been conducted to integrate BERT into the ASR model [ 49 ]. In [ 50 ], K Deng et al initialize the encoder using wav2vec2.0 [ 51 ], and the decoder through a pre-trained LM DistilGPT2, to take full advantage of the pre-trained acoustic and language models.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, some research has been conducted to integrate BERT into the ASR model [ 49 ]. In [ 50 ], K Deng et al initialize the encoder using wav2vec2.0 [ 51 ], and the decoder through a pre-trained LM DistilGPT2, to take full advantage of the pre-trained acoustic and language models.…”
Section: Related Workmentioning
confidence: 99%
“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”
Section: Related Workmentioning
confidence: 99%
“…The most straightforward method is to employ them as an acoustic feature encoder and then stack a simple layer of neural network on top of the encoder to do speech recognition [9]. After that, some studies present various cascade methods to concatenate pre-trained language and speech representation learning models for ASR [14,15,17,18]. Although these methods have proven their capabilities and effectiveness on benchmark corpora, their complicated model architectures and/or large-scaled model parameters have usually made them hard to be used in practice.…”
Section: Related Workmentioning
confidence: 99%
“…For text-only data, text is mainly used to train an external language model (LM) for joint decoding [11,12,13,14,15]. In order to make use of both unpaired speech and text, many methods have recently been proposed, e.g., integration of a pre-trained acoustic model and LM [16,17,18,19], cycle-consistency based dual-training [20,21,22,23], and shared representation learning [24,25,26,27], which rely on hybrid models with multitask training and some of which become less effective in cases with a very limited amount of labeled data. The current mainstream methods that achieve state-of-the-art (SOTA) results in low-resource ASR use unpaired speech and text for pre-training and training a LM for joint decoding, respectively [7,8], and adopt an additional iterative self-training [28].…”
Section: Introductionmentioning
confidence: 99%