GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Yang, Jinhyeok; Bae, Jae‐sung; Bak, Taejun; Kim, Young-Ik; Cho, Hoon Young

doi:10.21437/interspeech.2021-971

Cited by 17 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(9) and λ f m is a dynamically scaled scalar computed as λ f m = L recon /L f m following (Yang et al, 2021). Detailed training procedure as well as inference procedure is presented in Appendix B.…”

Section: Training Lossmentioning

confidence: 99%

“…The first counterpart is the representative non-AR TTS model FastSpeech 2 (Ren et al, 2021a). The second model is the GANSpeech model introduced in (Yang et al, 2021). The third model is the DiffSpeech model presented in (Liu et al, 2021a).…”

Section: Experimental Setup For Comparisonmentioning

confidence: 99%

“…where we adopt the least-squares GAN (LS-GAN) training formulation (Mao et al, 2017) to minimize D adv because of its various successful practices in audio generation domain (Kumar et al, 2019;Kong et al, 2020;Yang et al, 2021;Kim et al, 2021).…”

Section: Acoustic Generator and Discriminatormentioning

confidence: 99%

“…They either leverage an external text-to-acoustic alignment module (Ren et al, 2019;2021a;Peng et al, 2020;Elias et al, 2021) or jointly train one within the TTS model (Zeng et al, 2020;Miao et al, 2021;Badlani et al, 2021). Other generative models have also been studied for TTS, such as Flow-based models Miao et al, 2020), variational autoencoder (VAE)-based models (Lee et al, 2021;Liu et al, 2021b), and generative adversarial network (GAN)-based models (Donahue et al, 2021;Yang et al, 2021). TTS models combining different generative modeling techniques are also investigated, such as Flow with VAE (Ren et al, 2021b), Flow with VAE and GAN (Kim et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Liu¹,

Su²,

Yu³

2022

Preprint

View full text Add to dashboard Cite

Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.

show abstract

Section: Training Lossmentioning

confidence: 99%

Section: Experimental Setup For Comparisonmentioning

confidence: 99%

Section: Acoustic Generator and Discriminatormentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Liu¹,

Su²,

Yu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Adversarial loss 𝐿𝑎 is used to fool the discriminator by making 𝐶 𝑓 and 𝐹 𝑓 close to 1. Feature matching loss 𝐿 𝑓 is an effective loss function to improve stablity and quality of adversrial training [16,23]…”

Section: Training Algorithmmentioning

confidence: 99%

A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS

Guo¹,

Liu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

The generative adversarial network (GAN) has shown its outstanding capability in improving Non-Autoregressive TTS (NAR-TTS) by adversarially training it with an extra model that discriminates between the real and the generated speech. To maximize the benefits of GAN, it is crucial to find a powerful discriminator that can capture rich distinguishable information. In this paper, we propose a multi-scale time-frequency spectrogram discriminator to help NAR-TTS generate high-fidelity Mel-spectrograms. It treats the spectrogram as a 2D image to exploit the correlation among different components in the time-frequency domain. And a U-Net-based model structure is employed to discriminate at different scales to capture both coarse-grained and fine-grained information. We conduct subjective tests to evaluate the proposed approach. Both multi-scale and time-frequency discriminating bring significant improvement in the naturalness and fidelity. When combining the neural vocoder, it is shown more effective and concise than fine-tuning the vocoder. Finally, we visualize the discriminating maps to compare their difference to verify the effectiveness of multiscale discriminating.

show abstract

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Deng

Qiu

et al. 2023

IEEE Access

View full text Add to dashboard Cite

This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.INDEX TERMS Speech synthesis, diffusion model, mixture attention mechanism, deep learning.

show abstract

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Cited by 17 publications

References 0 publications

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Contact Info

Product

Resources

About