GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Yang, Jinhyeok; Bae, Jae‐sung; Bak, Taejun; Kim, Youngik; Cho, Hoon Young

doi:10.48550/arxiv.2106.15153

Cited by 2 publications

(2 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The discriminator structure is modeled and represented by D ϕ (x t−1 , x t , t, s) with learnable parameters ϕ. The discriminator uses joint conditional and unconditional loss (JCU) [26], which combines conditional and unconditional adversarial losses to further improve the accuracy of the mel-spectrogram and speech waveform mapping.…”

Section: Diffusion Decoder and Discriminatormentioning

confidence: 99%

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Deng

Qiu

et al. 2023

IEEE Access

View full text Add to dashboard Cite

This paper describes MixGAN-TTS, an efficient and stable non-autoregressive speech synthesis based on diffusion model. The MixGAN-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approach which explicitly extracts word-level semantic information, and introduces pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, we use the GAN to replace the Gaussian function to model the denoising distribution, aiming to enlarge the denoising steps size and reduce the number of denoising steps to accelerate the sampling speed of diffusion model. Diffusion model using GAN can significantly reduce the denoising steps, and to some extent solve the problem of not being able to apply in real-time. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. Experimental results show that the MixGAN-TTS outperforms the other models compared in terms of audio quality and mel-spectrogram modeling capability for 4 denoising steps. The ablation studies demonstrate that the structure of MixGAN-TTS is effective.INDEX TERMS Speech synthesis, diffusion model, mixture attention mechanism, deep learning.

show abstract

Section: Diffusion Decoder and Discriminatormentioning

confidence: 99%

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Deng

Qiu

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…In addition, speech synthesis has introduced some other generative models that have also achieved very good performance. Flow-based models are found in [ 15 , 16 , 17 ], variational autoencoder (VAE)-based models are listed in [ 17 , 18 ], generative adversarial network (GAN)-based models are presented in [ 19 ], and diffusion process-based models are described in [ 20 , 21 , 22 , 23 ].…”

Section: Introductionmentioning

confidence: 99%

Research on Speech Synthesis Based on Mixture Alignment Mechanism

Deng,

Wu,

Qiu

et al. 2023

Sensors

View full text Add to dashboard Cite

In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective.

show abstract

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Cited by 2 publications

References 25 publications

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

MixGAN-TTS: Efficient and Stable Speech Synthesis Based on Diffusion Model

Research on Speech Synthesis Based on Mixture Alignment Mechanism

Contact Info

Product

Resources

About