Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Deng, Keqi; Zhang, Yang; Watanabe, Shinji; Higuchi, Yuki; Cheng, Gaofeng; Zhang, Pengyuan

doi:10.1109/icassp43922.2022.9746316

Cited by 13 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, transformers offer the advantage of parallelising computations, enabling faster training of deeper models on larger datasets. Recently, language models have shown their power in capturing high-level, long-term patterns across different data types including text [21,96] and image [157,158], and speech [159][160][161]. This has also opened avenues for developing large language models in the speech and audio domain.…”

Section: Automatic Speech Recognition (Asr)mentioning

confidence: 99%

A survey on deep reinforcement learning for audio-based applications

Latif

Cuayáhuitl

Pervez³

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields including computer vision, natural language processing, healthcare, robotics, to name a few. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising applications in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together research studies across different but related areas in speech and music. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting important challenges faced by audio-based DRL agents and by highlighting open areas for future research and investigation. The findings of this paper will guide researchers interested in DRL for the audio domain.

show abstract

Section: Automatic Speech Recognition (Asr)mentioning

confidence: 99%

A survey on deep reinforcement learning for audio-based applications

Latif

Cuayáhuitl

Pervez³

et al. 2022

Artif Intell Rev

View full text Add to dashboard Cite

show abstract

“…Because of the success, previous studies have investigated the pre-trained language model to enhance the performance of ASR. On the one hand, several studies directly leverage a pre-trained language model as a portion of the ASR model [13,14,15,16,17,18,19]. Although such designs are straightforward, they can obtain satisfactory performances.…”

Section: Related Workmentioning

confidence: 99%

“…The most straightforward method is to employ them as an acoustic feature encoder and then stack a simple layer of neural network on top of the encoder to do speech recognition [9]. After that, some studies present various cascade methods to concatenate pre-trained language and speech representation learning models for ASR [14,15,17,18]. Although these methods have proven their capabilities and effectiveness on benchmark corpora, their complicated model architectures and/or large-scaled model parameters have usually made them hard to be used in practice.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A context-aware knowledge transferring strategy for CTC-based ASR

Lu¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

Non-autoregressive automatic speech recognition (ASR) modeling has received increasing attention recently because of its fast decoding speed and superior performance. Among representatives, methods based on the connectionist temporal classification (CTC) are still a dominating stream. However, the theoretically inherent flaw, the assumption of independence between tokens, creates a performance barrier for the school of works. To mitigate the challenge, we propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR. The former is designed to distill linguistic information from a pre-trained language model, and the latter is framed to modulate the limitations caused by the conditional independence assumption. As a result, a knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is presented in this paper. A series of experiments on the AISHELL-1 and AISHELL-2 datasets demonstrate the effectiveness of the proposed method.

show abstract

“…Non-autoregressive speech processing is first used in [18]. After that, many more non-autoregressive methods are proposed [19][20][21][22][23][24][25]. Among the methods, there are two that are appropriate to achieve non-autoregressive spell correction.…”

Section: Introductionmentioning

confidence: 99%

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Fan¹,

Ye²,

Gaur³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we take one step further, and explore the possibility of using MLM as a non-autoregressive spell correction (SC) model for transformertransducer (TT), denoted as MLM-SC. Our initial experiments show that MLM-SC provides no improvements on Librispeech data. The problem might be the choice of modeling units (word pieces) and the inaccuracy of the TT confidence scores for English data. To solve the problem, we propose a mask sample decoding (MS-decode) method where the masked tokens can have the choice of being masked or not to compensate for the inaccuracy. As a result, we reduce the WER of a streaming TT from 7.6% to 6.5% on the Librispeech testother data and the CER from 7.3% to 6.1% on the Aishell test data, respectively.

show abstract

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Cited by 13 publications

References 22 publications

A survey on deep reinforcement learning for audio-based applications

A survey on deep reinforcement learning for audio-based applications

A context-aware knowledge transferring strategy for CTC-based ASR

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Contact Info

Product

Resources

About