Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Zhang, Qian; Han, Lu; Sak, Haşim; Tripathi, Anshuman; McDermott, Erik; Koo, Stephen; Kumar, Shankar

doi:10.1109/icassp40776.2020.9053896

Cited by 364 publications

(241 citation statements)

References 19 publications

Supporting

Mentioning

237

Contrasting

Order By: Relevance

“…Therefore, several models based on attention mechanisms have been proposed to make it possible for Transducer models to exploit contextual information. Transformer-Transducer (T-T) [15,16] has been proposed on speech recognition, with Transformer [17] becoming the state-of-the-art approach in the language modeling and machine translations fields [18][19][20]. They replaced LSTM with the encoder part of Transformer, which mainly includes multi-head attention mechanisms, feedforward networks, and layer normalization, have been proposed on speech recognition.…”

Section: Introductionmentioning

confidence: 99%

“…Experiments that are based on T-T show that the accuracy of the streaming model considering contextual information is comparable to that of the offline models. Truncated self-attention adopted in [15] and masked self-attention adopted in [16] both reduce the error rate of the streaming model.…”

Section: Introductionmentioning

confidence: 99%

“…In general, the T-T model requires a deep Transformer. If each layer of the Transformer calculates the attention scores of the input sample points with the same context range, e.g., in [16], each layer masks the same number of future speech frames, the deep transformer will superimpose a high-latency. One solution is to make the context extracting mechanism independent of the deep network, acting only as an input layer.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

Wang

Zhao

2021

Information

View full text Add to dashboard Cite

Recent research shows recurrent neural network-Transducer (RNN-T) architecture has become a mainstream approach for streaming speech recognition. In this work, we investigate the VGG2 network as the input layer to the RNN-T in streaming speech recognition. Specifically, before the input feature is passed to the RNN-T, we introduce a gated-VGG2 block, which uses the first two layers of the VGG16 to extract contextual information in the time domain, and then use a SEnet-style gating mechanism to control what information in the channel domain is to be propagated to RNN-T. The results show that the RNN-T model with the proposed gated-VGG2 block brings significant performance improvement when compared to the existing RNN-T model, and it has a lower latency and character error rate than the Transformer-based model.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

Wang

Zhao

2021

Information

View full text Add to dashboard Cite

show abstract

“…Another researches are worked on local monotonic attention [24] [25]. Google proposed transformer encoders with RNN-T loss [26], and they showed that limiting the left and right context of attention per-layer can obtain not bad accuracy but still have some gap between the performance of full-attention models.…”

Section: Introductionmentioning

confidence: 99%

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition

Luo

Wang

Cheng

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self-attention models have been successfully applied in endto-end speech recognition systems, which greatly improve the performance of recognition accuracy. However, such attentionbased models cannot be used in online speech recognition, because these models usually have to utilize a whole acoustic sequences as inputs. A common method is restricting the field of attention sights by a fixed left and right window, which makes the computation costs manageable yet also introduces performance degradation. In this paper, we propose Memory-Self-Attention (MSA), which adds history information into the Restricted-Self-Attention unit. MSA only needs localtime features as inputs, and efficiently models long temporal contexts by attending memory states. Meanwhile, recurrent neural network transducer (RNN-T) has proved to be a great approach for online ASR tasks, because the alignments of RNN-T are local and monotonic. We propose a novel network structure, called Memory-Self-Attention (MSA) Transducer. Both encoder and decoder of the MSA Transducer contain the proposed MSA unit. The experiments demonstrate that our proposed models improve WER results than Restricted-Self-Attention models by 13.5% on WSJ and 7.1% on SWBD datasets relatively, and without much computation costs increase.

show abstract

“…Most conventional ASR systems [2,3] consist of various modules, such as the acoustic model, language model, and pronunciation dictionary, which are trained separately. In recent years, end-to-end ASR systems [4][5][6][7], which can be directly trained to maximize the probability of a word sequence given an acoustic feature sequence, have been the research focus. Many researchers [7,8] reported that end-to-end ASR systems can significantly simplify the speech recognition pipelines and outperform the conventional ASR systems on several representative speech datasets.…”

Section: Introductionmentioning

confidence: 99%

KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition

et al. 2020

View full text Add to dashboard Cite

This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.

show abstract

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Cited by 364 publications

References 19 publications

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition

KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition

Contact Info

Product

Resources

About