Katsuki Inoue scite author profile

This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-theart E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our best model outperforms other toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is available on GitHub 1 .

show abstract

An investigation to transplant emotional expressions in DNN-based TTS synthesis

Inoue

Hara

Abe

et al. 2017

View full text Add to dashboard Cite

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi

Yamamoto

Inoue

et al. 2019

Preprint

View full text Add to dashboard Cite

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Inoue

Hara

Abe

et al. 2020

View full text Add to dashboard Cite

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Inoue

Hara

Abe

et al. 2021

Speech Communication

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Katsuki Inoue

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

An investigation to transplant emotional expressions in DNN-based TTS synthesis

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Contact Info

Product

Resources

About