2022
DOI: 10.48550/arxiv.2201.11972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Abstract: Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 16 publications
(26 reference statements)
0
6
0
Order By: Relevance
“…The development of text-to-speech has undergone through a shift from a three-stage framework to a two-stage framework, as shown in Figure 1. Before applying neural networks, Diff-TTS [31] Grad-TTS [85] Code / Project Efficient acoustic model ProDiff [28] Project DiffGAN-TTS [60] Code Adaptive multi-speaker model Grad-TTS with ILVR [52] Project Grad-StyleSpeech [32] Project Guided-TTS [37] Guided-TTS 2 [38] Project With discrete latent space Diffsound [128] Project NoreSpeech [127] Fine-grained control EmoDiff [22] Project Vocoder…”
Section: Text-to-speech Synthesismentioning
confidence: 99%
See 3 more Smart Citations
“…The development of text-to-speech has undergone through a shift from a three-stage framework to a two-stage framework, as shown in Figure 1. Before applying neural networks, Diff-TTS [31] Grad-TTS [85] Code / Project Efficient acoustic model ProDiff [28] Project DiffGAN-TTS [60] Code Adaptive multi-speaker model Grad-TTS with ILVR [52] Project Grad-StyleSpeech [32] Project Guided-TTS [37] Guided-TTS 2 [38] Project With discrete latent space Diffsound [128] Project NoreSpeech [127] Fine-grained control EmoDiff [22] Project Vocoder…”
Section: Text-to-speech Synthesismentioning
confidence: 99%
“…Prior work [126] attributes the thousands of denoising steps in diffusion models to the fact that they commonly approximate the denoising distribution with Gaussian noise, thus requiring small step size. Inspired by [126] that adopts GAN to model the denoising distribution which enables larger step size and less steps, DiffGAN-TTS [60] applies a pretrained GAN as the acoustic generator for acceleration. To further speed up, DiffGAN-TTS [60] also introduces an active shallow diffusion mechanism that conduct denoising conditioned on the coarse prediction by pretrained GAN.…”
Section: Towards Efficient Acoustic Modelmentioning
confidence: 99%
See 2 more Smart Citations
“…Discriminator: Multi-scaled discriminators (MSD) [17] and Joint Conditional Unconditional discriminators (JCUD) [18] have been proven as the most efficient models in audio synthetic tasks. Inspired of them, we propose a Joint Conditional Unconditional Multi-scale discriminator (JCU-MSD), i.e., D φ , which is shown in Fig.…”
Section: Feature Reconstruction Networkmentioning
confidence: 99%