2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003750
|View full text |Cite
|
Sign up to set email alerts
|

A Comparative Study on Transformer vs RNN in Speech Applications

Abstract: Sequence-to-sequence models have been widely used in end-toend speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

11
409
1
2

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
5

Relationship

3
7

Authors

Journals

citations
Cited by 572 publications
(424 citation statements)
references
References 39 publications
11
409
1
2
Order By: Relevance
“…In this paper, we introduce a new E2E-TTS toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet [16], [17]. The toolkit is developed for the 1 https://github.com/espnet/espnet research purpose to make E2E-TTS systems more user-friendly and to accelerate research in this field.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we introduce a new E2E-TTS toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet [16], [17]. The toolkit is developed for the 1 https://github.com/espnet/espnet research purpose to make E2E-TTS systems more user-friendly and to accelerate research in this field.…”
Section: Introductionmentioning
confidence: 99%
“…All of our experiments use the same mixed convolutionalrecurrent end-to-end model architecture for conditional sequence generation, our focus being data augmentation techniques. (Recent work suggests that AST performance with Transformer is similar to AST performance with this style of model [17].) We use a speech encoder consisting of two non-linear layers followed by two convolutional layers and three bidirectional LSTM layers, along with a custom LSTM decoder [13,7].…”
Section: Model Architecturementioning
confidence: 99%
“…Recently, Transformer models [15] have shown impressive performance in many tasks, such as pretrained language models [16,17], end-to-end speech recognition [18,19], and speaker diarization [20], surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribution information of the whole input sequence and maps the sequence into a vector at every time step.…”
Section: Introductionmentioning
confidence: 99%