ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Xue, Jinlong; Deng, Yayue; Han, Yichen; Li, Ya; Sun, Jun; Liang, Jiaen

doi:10.48550/arxiv.2203.10473

Cited by 2 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speaker modeling on LibriTTS: Following prior work [26, 2 https://github.com/pytorch/fairseq/tree/main/ examples/wav2vec/unsupervised 3 https://github.com/espnet/espnet/tree/master/egs2/ 27, 28], a pre-trained speaker verification model ECAPA-TDNN [29] from SpeechBrain [30] is used as the speaker encoder to generate speaker embeddings. The Transformer-TTS is conditioned on the speaker embedding and Global Style Tokens (GST) [31] for multi-speaker modeling [27,28]. We found that fine-tuning the TTS on each target speaker for 10 epochs significantly improves the synthesis quality.…”

Section: Tts Modelmentioning

confidence: 99%

Simple and Effective Unsupervised Speech Synthesis

Liu¹,

Lai²,

Hsu³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.

show abstract

Section: Tts Modelmentioning

confidence: 99%

Simple and Effective Unsupervised Speech Synthesis

Liu¹,

Lai²,

Hsu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The main objective of developing TTS systems with this method is to capture the speaker characteristics by extracting text-independent speaker embeddings from the target speaker's voice. Commonly used speaker encoders in TTS systems include d-vector and x-vector [26]. In order for these methods to be adapted to Turkish speech synthesis systems, a large-sized speech-text corpus or pre-trained acoustic models are needed [26].…”

Section: Introductionmentioning

confidence: 99%

“…Commonly used speaker encoders in TTS systems include d-vector and x-vector [26]. In order for these methods to be adapted to Turkish speech synthesis systems, a large-sized speech-text corpus or pre-trained acoustic models are needed [26]. Currently, the lack of a Turkish corpus or a previously trained acoustic model available makes it difficult to process Turkish in the field of TTS.…”

Section: Introductionmentioning

confidence: 99%

A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning

Oyucu

2023

Electronics

View full text Add to dashboard Cite

Text-to-Speech (TTS) systems have made strides but creating natural-sounding human voices remains challenging. Existing methods rely on noncomprehensive models with only one-layer nonlinear transformations, which are less effective for processing complex data such as speech, images, and video. To overcome this, deep learning (DL)-based solutions have been proposed for TTS but require a large amount of training data. Unfortunately, there is no available corpus for Turkish TTS, unlike English, which has ample resources. To address this, our study focused on developing a Turkish speech synthesis system using a DL approach. We obtained a large corpus from a male speaker and proposed a Tacotron 2 + HiFi-GAN structure for the TTS system. Real users rated the quality of synthesized speech as 4.49 using Mean Opinion Score (MOS). Additionally, MOS-Listening Quality Objective evaluated the speech quality objectively, obtaining a score of 4.32. The speech waveform inference time was determined by a real-time factor, with 1 s of speech data synthesized in 0.92 s. To the best of our knowledge, these findings represent the first documented deep learning and HiFi-GAN-based TTS system for Turkish TTS.

show abstract

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Cited by 2 publications

References 22 publications

Simple and Effective Unsupervised Speech Synthesis

Simple and Effective Unsupervised Speech Synthesis

A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning

Contact Info

Product

Resources

About