Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Hayashi, Tomoki; Yamamoto, Ryōichi; Inoue, Katsuki; Yoshimura, Teizo; Watanabe, Shinji; Toda, Tomoki; Takeda, Kazuya; Zhang, Yu; Tan, Xu

doi:10.1109/icassp40776.2020.9053512

Cited by 141 publications

(88 citation statements)

References 30 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From Tables I and III, it was shown that without TTS-oriented pretraining, VTNs were less robust to training data reduction than RNNs in terms of objective measures but better in terms of subjective measures. This is possibly because that a more complex model like VTN is capable of generating better-sounding voices while being more prone to overfitting since it lacks attention regularizations such as the location-sensitive location, as suggested in [68]. As we applied TTS-oriented pretraining to both VTN and RNN, it could be clearly observed that VTNs outperformed RNNs in terms of all objective measures except F0RMSE and subjective scores.…”

Section: Comparison Of Rnn and Transformer Based Modelsmentioning

confidence: 98%

“…In addition to the L1, L2, and weighted binary cross-entropy losses, the guided attention loss is also applied. As pointed out in [66], in Transformer-based speech synthesis, not all attention heads demonstrate diagonal alignments, so following [30], [68], the guided attention loss is applied to partial heads in partial decoder layers.…”

Section: Transformer-based Modelmentioning

confidence: 99%

“…2) Implementation: The entire experiment was carried out on the open-source ESPnet toolkit [68], [72], including feature extraction, training and benchmarking. The official implementation has been made publicly available 1 , and since readers may access all the settings and configurations online, we omit the detailed hyperparameters here.…”

Section: Experimental Evaluationsmentioning

confidence: 99%

See 2 more Smart Citations

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Huang

Hayashi

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Section: Comparison Of Rnn and Transformer Based Modelsmentioning

confidence: 98%

Section: Transformer-based Modelmentioning

confidence: 99%

Section: Experimental Evaluationsmentioning

confidence: 99%

See 1 more Smart Citation

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Huang

Hayashi

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

“…We build the experimental models by modifying the endto-end text-to-speech toolkit called ESPnet-TTS [37], which gives the unified and reproducible implementations of popular end-to-end TTS networks 3 . The models to be compared are briefly described as follows while more detailed hyperparameters can be found in the related references:…”

Section: A Settingsmentioning

confidence: 99%

Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution

Zhao

2021

IEEE Access

View full text Add to dashboard Cite

Owing to the powerful self-attention mechanism, the Transformer network has achieved considerable successes across many sequence modeling tasks and has become one of the most popular methods in text-to-speech (TTS). The vanilla self-attention excels in capturing long-range dependencies but suffers in modeling stable short-range dependencies that are quite important for speech synthesis where the local audio signals are highly correlated. To address this problem, we propose the hybrid lightweight convolution (HLC), which is responsible for fully exploiting local structures of a sequence, and combine it with the self-attention to improve the Transformer-based TTS. The experimental results show that our modified model obtains better performance in both objective and subjective evaluations. At the same time, we also demonstrate that a more compact TTS model may be built through the combination of self-attention and proposed hybrid lightweight convolution. Besides, this method is also potentially adaptable for other sequence modeling tasks.

show abstract

“…As this approach gains traction in the TTS field (c.f. the ESPnet-TTS project [9]), the continuing role of any separate front-end processing is naturally brought in question. Given its cost in time and money to develop, what value does it bring, and how can it be simplified and optimised for S2S TTS?…”

Section: Introductionmentioning

confidence: 99%

Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

Taylor¹,

Richmond

2020

Interspeech 2020

View full text Add to dashboard Cite

Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.

show abstract

Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Cited by 141 publications

References 30 publications

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Enhancing Local Dependencies for Transformer-Based Text-to-Speech via Hybrid Lightweight Convolution

Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

Contact Info

Product

Resources

About