2020
DOI: 10.48550/arxiv.2005.04587
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(11 citation statements)
references
References 16 publications
0
11
0
Order By: Relevance
“…In a typical multi-speaker TTS system [54], speaker information can be provided by the speaker embedding extracted by a speaker encoder that is trained in a speaker verification task with speech as input. Here, we also trained a speech-based speaker encoder in the speaker verification task with the large margin softmax loss [55].…”
Section: B Face Encodermentioning
confidence: 99%
See 2 more Smart Citations
“…In a typical multi-speaker TTS system [54], speaker information can be provided by the speaker embedding extracted by a speaker encoder that is trained in a speaker verification task with speech as input. Here, we also trained a speech-based speaker encoder in the speaker verification task with the large margin softmax loss [55].…”
Section: B Face Encodermentioning
confidence: 99%
“…As mentioned in the last subsection, due to the inaccessible of speech-face paired database, we can not train a faceconditioned multi-speaker TTS directly. Instead, we train a multi-speaker TTS model with text-speech paired database as that in general multi-speaker TTS method [54]. Specifically, during the training process, speaker information can be provided by the speaker embedding extracted with a speaker encoder.…”
Section: Face-conditioned Multi-speaker Ttsmentioning
confidence: 99%
See 1 more Smart Citation
“…We also apply the loss proposed by [16] to the output of PostNet as the speaker reconstruction loss. Denoted the speaker embedding extracted from target Mel-spectrogram as s and the one extracted from the predicted Mel-spectrogram as ŝ, then the speaker reconstruction loss is…”
Section: Supervision On Prediceted Mel-spectrogrammentioning
confidence: 99%
“…Then we erase the speaker information from the intermediate representation by incorporating the speaker encoder. Our proposed system also applies the feedback loss control for speaker embedding, which is proposed by [16] and applied on VC in [17]. By adding this loss, the generated results of our proposed system demonstrate high spoofing capability to speaker verification systems.…”
Section: Introductionmentioning
confidence: 98%