2022
DOI: 10.48550/arxiv.2206.02246
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

Abstract: We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (∼ 3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 13 publications
0
3
0
Order By: Relevance
“…The development of text-to-speech has undergone through a shift from a three-stage framework to a two-stage framework, as shown in Figure 1. Before applying neural networks, Diff-TTS [31] Grad-TTS [85] Code / Project Efficient acoustic model ProDiff [28] Project DiffGAN-TTS [60] Code Adaptive multi-speaker model Grad-TTS with ILVR [52] Project Grad-StyleSpeech [32] Project Guided-TTS [37] Guided-TTS 2 [38] Project With discrete latent space Diffsound [128] Project NoreSpeech [127] Fine-grained control EmoDiff [22] Project Vocoder…”
Section: Text-to-speech Synthesismentioning
confidence: 99%
See 1 more Smart Citation
“…The development of text-to-speech has undergone through a shift from a three-stage framework to a two-stage framework, as shown in Figure 1. Before applying neural networks, Diff-TTS [31] Grad-TTS [85] Code / Project Efficient acoustic model ProDiff [28] Project DiffGAN-TTS [60] Code Adaptive multi-speaker model Grad-TTS with ILVR [52] Project Grad-StyleSpeech [32] Project Guided-TTS [37] Guided-TTS 2 [38] Project With discrete latent space Diffsound [128] Project NoreSpeech [127] Fine-grained control EmoDiff [22] Project Vocoder…”
Section: Text-to-speech Synthesismentioning
confidence: 99%
“…However, they can be trained only when the transcribed data of the target speaker is provided. By applying iterative latent variable sampling [11] to Grad-TTS [85], Grad-TTS with ILVR [52] proposes to mix the latent variable with the reference voice from a target speaker during inference, leading to zero-shot speaker without any training. Also follows Grad-TTS [85], another work Grad-StyleSpeech [32] encodes the mel-spectrogram of reference speech to a styled vector, which is involved in the training of diffusion models.…”
Section: Adaptive Modeling For Multi-speaker Settingmentioning
confidence: 99%
“…Denoising diffusion models [60,64] have seen great success on a wide variety of different challenges, ranging from image2image translation tasks like inpainting, colorisation, image upscaling, uncropping [6,26,41,42,50,53,57,59], audio generation [11,28,33,35,38,48,67,80], text-based image generation [4,21,23,46,51,55,58], video generation [24,27,82,86], and many others. For a thorough review on diffusion models and all of their recent applications, we recommend [81].…”
Section: Diffusion Modelsmentioning
confidence: 99%