A Comparative Study on Transformer vs RNN in Speech Applications

Karita, Shigeki; Chen, Nanxin; Hayashi, Tomoki; Hori, Takaaki; Inaguma, Hirofumi; Jiang, Ziyan; Someki, Masao; Soplin, Nelson Enrique Yalta; Yamamoto, Ryōichi; Wang, Xiaofei; Watanabe, Shinji; Yoshimura, Teizo; Zhang, Wangyou

doi:10.1109/asru46091.2019.9003750

Cited by 572 publications

(424 citation statements)

References 39 publications

Supporting

Mentioning

409

Contrasting

Unclassified

Order By: Relevance

“…In this paper, we introduce a new E2E-TTS toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet [16], [17]. The toolkit is developed for the 1 https://github.com/espnet/espnet research purpose to make E2E-TTS systems more user-friendly and to accelerate research in this field.…”

Section: Introductionmentioning

confidence: 99%

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi

Yamamoto

Inoue

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

140

View full text Add to dashboard Cite

This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-theart E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our best model outperforms other toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is available on GitHub 1 .

show abstract

Section: Introductionmentioning

confidence: 99%

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi

Yamamoto

Inoue

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

140

View full text Add to dashboard Cite

show abstract

“…All of our experiments use the same mixed convolutionalrecurrent end-to-end model architecture for conditional sequence generation, our focus being data augmentation techniques. (Recent work suggests that AST performance with Transformer is similar to AST performance with this style of model [17].) We use a speech encoder consisting of two non-linear layers followed by two convolutional layers and three bidirectional LSTM layers, along with a custom LSTM decoder [13,7].…”

Section: Model Architecturementioning

confidence: 99%

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

McCarthy

Puzon

Pino

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English-French and English-Romanian automatic speech translation (AST) tasks as well as on a low-resource English automatic speech recognition (ASR) task. Further, in ablations, we show the benefits of both quantity and diversity in augmented data. Finally, we show that we can combine our approach with augmentation by machine-translated transcripts to obtain a competitive end-toend AST model that outperforms a very strong cascade model on an English-French AST task. Our method is sufficiently general that it can be applied to other speech generation and analysis tasks.

show abstract

“…Recently, Transformer models [15] have shown impressive performance in many tasks, such as pretrained language models [16,17], end-to-end speech recognition [18,19], and speaker diarization [20], surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribution information of the whole input sequence and maps the sequence into a vector at every time step.…”

Section: Introductionmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Attention-DecoderEnc 1 SD < l a t e x i t s h a 1 _ b a s e 6 4 = " Z E x 5 4 u 5 D h 9 n r k V q v e X 1 a T 1 D L f 2 g = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " / 1 B 3 S g + + 4 5 a r q j n e 5 e M f V V 5 b I N Y = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 B o m w M L K m b V h 9 L o G h m G h C m e T 6 r x Y b U U k Z 6 s 7 K u g R n P v I i a d W q z l m 1 d n t e q d t F H S V y R I 7 J K X H I B a m T G 9 I g T c J I R p 7 J K 3 k z n o w X 4 9 3 4 m I 0 u G c X O A f k D 4 / M H + Y S X P A = = < / l a t e x i t > EncMix < l a t e x i t s h a 1 _ b a s e 6 4 = " t H P Y 8 W F E w 4 g s 1 l 5 Q M 1 a 7 j E W l J 1 Y = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l Q Q Q 3 Q g X 7 g D a E y A 2 T 4 V C s D 7 E d S v x B w r v 7 e S E m g 1 D T w 9 G R A Y K Q W v Z n 4 n 9 d L w L 9 y U h 7 G C T A d M T / k J w J D h G d t 4 A G X j I K Y a k K o 5 P q v m I 6 I J B R 0 Z x V d g r 0 Y e Z m 0 6 z X 7 v F a / v 6 g 2 r K K O M j p G J + g M 2 e g S N d A t a q I W o i h D z + g V v R l P x o v x b n z M R 0 t G s X O I / s D 4 / A

show abstract

A Comparative Study on Transformer vs RNN in Speech Applications

Cited by 572 publications

References 39 publications

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

End-To-End Multi-Speaker Speech Recognition With Transformer

Contact Info

Product

Resources

About