2022
DOI: 10.48550/arxiv.2203.10473
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Abstract: In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate highquality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 22 publications
0
1
0
Order By: Relevance
“…Speaker modeling on LibriTTS: Following prior work [26, 2 https://github.com/pytorch/fairseq/tree/main/ examples/wav2vec/unsupervised 3 https://github.com/espnet/espnet/tree/master/egs2/ 27, 28], a pre-trained speaker verification model ECAPA-TDNN [29] from SpeechBrain [30] is used as the speaker encoder to generate speaker embeddings. The Transformer-TTS is conditioned on the speaker embedding and Global Style Tokens (GST) [31] for multi-speaker modeling [27,28]. We found that fine-tuning the TTS on each target speaker for 10 epochs significantly improves the synthesis quality.…”
Section: Tts Modelmentioning
confidence: 99%
“…Speaker modeling on LibriTTS: Following prior work [26, 2 https://github.com/pytorch/fairseq/tree/main/ examples/wav2vec/unsupervised 3 https://github.com/espnet/espnet/tree/master/egs2/ 27, 28], a pre-trained speaker verification model ECAPA-TDNN [29] from SpeechBrain [30] is used as the speaker encoder to generate speaker embeddings. The Transformer-TTS is conditioned on the speaker embedding and Global Style Tokens (GST) [31] for multi-speaker modeling [27,28]. We found that fine-tuning the TTS on each target speaker for 10 epochs significantly improves the synthesis quality.…”
Section: Tts Modelmentioning
confidence: 99%
“…The main objective of developing TTS systems with this method is to capture the speaker characteristics by extracting text-independent speaker embeddings from the target speaker's voice. Commonly used speaker encoders in TTS systems include d-vector and x-vector [26]. In order for these methods to be adapted to Turkish speech synthesis systems, a large-sized speech-text corpus or pre-trained acoustic models are needed [26].…”
Section: Introductionmentioning
confidence: 99%
“…Commonly used speaker encoders in TTS systems include d-vector and x-vector [26]. In order for these methods to be adapted to Turkish speech synthesis systems, a large-sized speech-text corpus or pre-trained acoustic models are needed [26]. Currently, the lack of a Turkish corpus or a previously trained acoustic model available makes it difficult to process Turkish in the field of TTS.…”
Section: Introductionmentioning
confidence: 99%