Haoran Miao scite author profile

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

show abstract

Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Miao

Cheng

Zhang

et al. 2019

View full text Add to dashboard Cite

The hybrid CTC/attention end-to-end automatic speech recognition (ASR) combines CTC ASR system and attention ASR system into a single neural network. Although the hybrid CTC/attention ASR system takes the advantages of both CTC and attention architectures in training and decoding, it remains challenging to be used for streaming speech recognition for its attention mechanism, CTC prefix probability and bidirectional encoder. In this paper, we propose a stable monotonic chunkwise attention (sMoChA) to stream its attention branch and a truncated CTC prefix probability (T-CTC) to stream its CTC branch. On the acoustic model side, we utilize the latencycontrolled bidirectional long short-term memory (LC-BLSTM) to stream its encoder. On the joint CTC/attention decoding side, we propose the dynamic waiting joint decoding (DWDJ) algorithm to collect the decoding hypotheses from the CTC and attention branches. Through the combination of the above methods, we stream the hybrid CTC/attention ASR system without much word error rate degradation.

show abstract

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Miao¹,

Cheng²,

Zhang³

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Recently, there has been increasing progress in endto-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN) / hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This paper describes our proposed online hybrid CTC/attention endto-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-scale online solution for CTC/attention end-to-end ASR architecture.

show abstract

Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

Yang

Cheng

Miao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Haoran Miao

Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture

Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

Contact Info

Product

Resources

About