Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Karita, Shigeki; Soplin, Nelson Enrique Yalta; Watanabe, Shinji; Delcroix, Marc; Ogawa, Akihiro; Nakatani, Tomohiro

doi:10.21437/interspeech.2019-1938

Cited by 178 publications

(160 citation statements)

References 15 publications

(43 reference statements)

Supporting

Mentioning

149

Contrasting

Unclassified

Order By: Relevance

“…An RNN-based language model (LM) is employed via shallow fusion. The RNN-LM consists of 4 LSTM layers with 2048 units each [13], CTC prefix beam search decoding only [20], and attention beam search decoding only [3]. In addition, results for including the RNN-LM, for using data augmentation [25] as well as for the large transformer setup are shown.…”

Section: Datasetmentioning

confidence: 99%

“…In addition, results for including the RNN-LM, for using data augmentation [25] as well as for the large transformer setup are shown. Table 1 presents ASR results of our transformer-based baseline systems, which are jointly trained with CTC to optimize training convergence and ASR accuracy [3,13]. Results of different decoding methods are shown with and without using the RNN-LM, SpecAugment [25], and the large transformer model.…”

Section: Datasetmentioning

confidence: 99%

“…Hybrid hidden Markov model (HMM) based automatic speech recognition (ASR) systems have provided state-of-the-art results for the last few decades [1,2]. End-to-end ASR systems, which approach the speech-to-text conversion problem using a single sequence-to-sequence model, have recently demonstrated competitive performance [3]. The most popular and successful end-to-end ASR approaches are based on connectionist temporal classification (CTC) [4], recurrent neural network (RNN) transducer (RNN-T) [5], and attention-based encoder-decoder architectures [6].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Streaming Automatic Speech Recognition with the Transformer Model

Moritz

Hori

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

179

136

View full text Add to dashboard Cite

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

show abstract

Section: Datasetmentioning

confidence: 99%

Section: Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Streaming Automatic Speech Recognition with the Transformer Model

Moritz

Hori

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

179

136

View full text Add to dashboard Cite

show abstract

“…Recently, Transformer [12] has gained success in ASR field [13,14,15]. Transformer-based models are parallelizable and competitive to recurrent neural networks [16].…”

Section: Introductionmentioning

confidence: 99%

Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture

Miao

Cheng

Gao

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

show abstract

“…Recently, Transformer models [15] have shown impressive performance in many tasks, such as pretrained language models [16,17], end-to-end speech recognition [18,19], and speaker diarization [20], surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribution information of the whole input sequence and maps the sequence into a vector at every time step.…”

Section: Introductionmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Attention-DecoderEnc 1 SD < l a t e x i t s h a 1 _ b a s e 6 4 = " Z E x 5 4 u 5 D h 9 n r k V q v e X 1 a T 1 D L f 2 g = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " / 1 B 3 S g + + 4 5 a r q j n e 5 e M f V V 5 b I N Y = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 B o m w M L K m b V h 9 L o G h m G h C m e T 6 r x Y b U U k Z 6 s 7 K u g R n P v I i a d W q z l m 1 d n t e q d t F H S V y R I 7 J K X H I B a m T G 9 I g T c J I R p 7 J K 3 k z n o w X 4 9 3 4 m I 0 u G c X O A f k D 4 / M H + Y S X P A = = < / l a t e x i t > EncMix < l a t e x i t s h a 1 _ b a s e 6 4 = " t H P Y 8 W F E w 4 g s 1 l 5 Q M 1 a 7 j E W l J 1 Y = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l Q Q Q 3 Q g X 7 g D a E y A 2 T 4 V C s D 7 E d S v x B w r v 7 e S E m g 1 D T w 9 G R A Y K Q W v Z n 4 n 9 d L w L 9 y U h 7 G C T A d M T / k J w J D h G d t 4 A G X j I K Y a k K o 5 P q v m I 6 I J B R 0 Z x V d g r 0 Y e Z m 0 6 z X 7 v F a / v 6 g 2 r K K O M j p G J + g M 2 e g S N d A t a q I W o i h D z + g V v R l P x o v x b n z M R 0 t G s X O I / s D 4 / A

show abstract

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Cited by 178 publications

References 15 publications

Streaming Automatic Speech Recognition with the Transformer Model

Streaming Automatic Speech Recognition with the Transformer Model

Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture

End-To-End Multi-Speaker Speech Recognition With Transformer

Contact Info

Product

Resources

About