Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Miao, Haoran; Cheng, Gaofeng; Zhang, Pengyuan; Yan, Yonghong

doi:10.1109/taslp.2020.2987752

Cited by 45 publications

(33 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [47], the authors proposed an online hybrid CTC/attention E2E ASR architecture that replaces all the offline components of a conventional CTC/attention ASR architecture with their corresponding streaming components by using LibriSpeech English and Mandarin tasks (from the Hong Kong University of Science and Technology, HKUST) to decode the speech in a low-latency and real-time manner. The researchers in [92] introduced a combined framework to integrate social signal detection (SSD) and ASR systems based on CTC, which is an end-to-end model.…”

Section: ) Signal Processingmentioning

confidence: 99%

Automatic Speech Recognition: Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

A huge amount of research has been done in the field of speech signal processing in recent years. In particular, there has been increasing interest in the automatic speech recognition (ASR) technology field. ASR began with simple systems that responded to a limited number of sounds and has evolved into sophisticated systems that respond fluently to natural language. This systematic review of automatic speech recognition is provided to help other researchers with the most significant topics published in the last six years. This research will also help in identifying recent major ASR challenges in real-world environments. In addition, it discusses current research gaps in ASR. This review covers articles available in five research databases that were completed according to the preferred reporting items for systematic reviews and metaanalyses (PRISMA) protocol. The search strategy yielded 45 articles related to the study's scope for the period 2015-2020. The results presented in this review shed light on research trends in the area of ASR and also suggest new research directions.

show abstract

Section: ) Signal Processingmentioning

confidence: 99%

Automatic Speech Recognition: Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

show abstract

“…MTA [19] aims to solve the training and decoding mismatch problem. Specifically, MoChA and sMoChA only take the context with a predefined chunk width w during decoding, but they receive the full historical information of input during training.…”

Section: Streaming Attentionmentioning

confidence: 99%

“…7. For full details, please refer to [19]. We use both sMoChA and MTA techniques in the experiments section.…”

Section: Streaming Attentionmentioning

confidence: 99%

Preventing Early Endpointing for Online Automatic Speech Recognition

Zhao

Leung

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the recent development of end-to-end models in speech recognition, there have been more interests in adapting these models for online speech recognition. However, using endto-end models for online speech recognition is known to suffer from an early endpointing problem, which brings in many deletion errors. In this paper, we propose to address the early endpointing problem from the gradient perspective. Specifically, we leverage on the recently proposed ScaleGrad technique, which was proposed to mitigate the text degeneration issue. Different from ScaleGrad, we adapt it to discourage the early generation of the end-of-sentence () token. A scaling term is added to directly maneuver the gradient of the training loss to encourage the model to learn to keep generating non- tokens. Compared with previous approaches such as voice-activity-detection and end-of-query detection, the proposed method does not rely on various types of silence, and it also saves the trouble from obtaining the ground truth endpoint with forced alignment. Nevertheless, it can be jointly applied with other techniques. Experiments on AISHELL-1 dataset show that our model brings relative 5.4%-10.1% CER reductions over the baseline, and surpasses the unlikelihood training method which directly reduces the generation probability of token.

show abstract

“…This not only speeds up the model training, but also expands the scope of attention to all the encoding timesteps before the current truncating point. In addition, MTA has so far given the best ASR performance among various hard attention mechanisms according to [28].…”

Section: Transformer For Online Asrmentioning

confidence: 99%

End-to-end Speech Recognition with Adaptive Computation Steps

Li¹,

Liu²,

Hattori³

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we present Adaptive Computation Steps (ACS) algorithm, which enables end-to-end speech recognition models to dynamically decide how many frames should be processed to predict a linguistic output. The model that applies ACS algorithm follows the encoder-decoder framework, while unlike the attention-based models, it produces alignments independently at the encoder side using the correlation between adjacent frames. Thus, predictions can be made as soon as sufficient acoustic information is received, which makes the model applicable in online cases. Besides, a small change is made to the decoding stage of the encoder-decoder framework, which allows the prediction to exploit bidirectional contexts. We verify the ACS algorithm on a Mandarin speech corpus AIShell-1, and it achieves a 31.2% CER in the online occasion, compared to the 32.4% CER of the attention-based model. To fully demonstrate the advantage of ACS algorithm, offline experiments are conducted, in which our ACS model achieves an 18.7% CER, outperforming the attention-based counterpart with the CER of 22.0%.

show abstract

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Cited by 45 publications

References 38 publications

Automatic Speech Recognition: Systematic Literature Review

Automatic Speech Recognition: Systematic Literature Review

Preventing Early Endpointing for Online Automatic Speech Recognition

End-to-end Speech Recognition with Adaptive Computation Steps

Contact Info

Product

Resources

About