Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1032
|View full text |Cite
|
Sign up to set email alerts
|

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Abstract: High-fidelity speech can be synthesized by end-to-end text-tospeech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraints for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
21
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(21 citation statements)
references
References 18 publications
0
21
0
Order By: Relevance
“…As the classifier extracts high-level features that are relevant for phone recognition, this loss term supervises the training of WaveNet to look after temporal dynamics, and penalize bad pronunciations. Cai et al [77] study to use a pre-trained speaker embedding network to provide feedback constraint, that serves as the perceptual loss for the training of a multispeaker TTS system.…”
Section: Perceptual Loss For Style Reconstructionmentioning
confidence: 99%
“…As the classifier extracts high-level features that are relevant for phone recognition, this loss term supervises the training of WaveNet to look after temporal dynamics, and penalize bad pronunciations. Cai et al [77] study to use a pre-trained speaker embedding network to provide feedback constraint, that serves as the perceptual loss for the training of a multispeaker TTS system.…”
Section: Perceptual Loss For Style Reconstructionmentioning
confidence: 99%
“…The zero-shot speaker adaptation approach has been used by recent studies to synthesize speech from unseen target speakers without requiring a retraining of the model for those speakers [20] [21] [22] [23]. It utilizes a speaker encoder that has been trained separately for speaker recognition tasks such as automatic speaker identification (ASI) and automatic speaker verification (ASV).…”
Section: Introductionmentioning
confidence: 99%
“…It is widely used to model speaker representation. This includes zero-shot speaker adaptation in Tacotron-based TTS that uses a DNNbased x-vector embedding [20] [22], d-vector embedding [21], and ResNet34 speaker embedding [23].…”
Section: Introductionmentioning
confidence: 99%
“…Previous research about multi-speaker TTS typically uses a speaker representation to control the synthesized utterance's speaker identity. This speaker representation can be jointly learned with the TTS model in the form of an embedding table [4,5,6,7] or a speaker encoder [5,6,8], or can be transferred from another pretrained model for speaker information extraction [9,10]. While to control the style of synthesized speech, global style token (GST) * These authors contributed equally.…”
Section: Introductionmentioning
confidence: 99%