ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053512
|View full text |Cite
|
Sign up to set email alerts
|

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Abstract: This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-theart E2E-TTS models, including Tacotron 2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
86
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 141 publications
(88 citation statements)
references
References 30 publications
(51 reference statements)
2
86
0
Order By: Relevance
“…From Tables I and III, it was shown that without TTS-oriented pretraining, VTNs were less robust to training data reduction than RNNs in terms of objective measures but better in terms of subjective measures. This is possibly because that a more complex model like VTN is capable of generating better-sounding voices while being more prone to overfitting since it lacks attention regularizations such as the location-sensitive location, as suggested in [68]. As we applied TTS-oriented pretraining to both VTN and RNN, it could be clearly observed that VTNs outperformed RNNs in terms of all objective measures except F0RMSE and subjective scores.…”
Section: Comparison Of Rnn and Transformer Based Modelsmentioning
confidence: 98%
See 2 more Smart Citations
“…From Tables I and III, it was shown that without TTS-oriented pretraining, VTNs were less robust to training data reduction than RNNs in terms of objective measures but better in terms of subjective measures. This is possibly because that a more complex model like VTN is capable of generating better-sounding voices while being more prone to overfitting since it lacks attention regularizations such as the location-sensitive location, as suggested in [68]. As we applied TTS-oriented pretraining to both VTN and RNN, it could be clearly observed that VTNs outperformed RNNs in terms of all objective measures except F0RMSE and subjective scores.…”
Section: Comparison Of Rnn and Transformer Based Modelsmentioning
confidence: 98%
“…In addition to the L1, L2, and weighted binary cross-entropy losses, the guided attention loss is also applied. As pointed out in [66], in Transformer-based speech synthesis, not all attention heads demonstrate diagonal alignments, so following [30], [68], the guided attention loss is applied to partial heads in partial decoder layers.…”
Section: Transformer-based Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…We build the experimental models by modifying the endto-end text-to-speech toolkit called ESPnet-TTS [37], which gives the unified and reproducible implementations of popular end-to-end TTS networks 3 . The models to be compared are briefly described as follows while more detailed hyperparameters can be found in the related references:…”
Section: A Settingsmentioning
confidence: 99%
“…As this approach gains traction in the TTS field (c.f. the ESPnet-TTS project [9]), the continuing role of any separate front-end processing is naturally brought in question. Given its cost in time and money to develop, what value does it bring, and how can it be simplified and optimised for S2S TTS?…”
Section: Introductionmentioning
confidence: 99%