Reducing over-smoothness in speech synthesis using Generative Adversarial Networks

Sheng, Leyuan; Pavlovskiy, Evgeniy

doi:10.1109/sibircon48586.2019.8957862

Cited by 4 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to base Tacotron loss, we use guided attention loss for faster attention convergence. we also use the Structural Similarity Index (SSIM) loss [23] to increase the stability of the training and make Mel-spectrograms less blurry. Highquality vocoder can make the audio quality difference caused by spectral blurring more obvious.…”

Section: The Loss For Acoustic Modelingmentioning

confidence: 99%

The HITSZ TTS system for Blizzard challenge 2020

Fu,

Zhang,

Liu

et al. 2020

Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

View full text Add to dashboard Cite

In this paper, we present the techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020. The corpus released to the participants this year is about 10-hours speech recordings from a Chinese male speaker with mixed Mandarin and English speech. Based on the above situation, we build an end to end speech synthesis system for this task. It is divided into the following parts: (1) the front-end module to analyze the pronunciation and prosody of text; (2) The phonemeconverted tool; (3) The forward-attention based sequence-tosequence acoustic model with jointly learning with prosody labels to predict 80-dimensional Mel-spectrogram; (4) The Parallel WaveGAN based neural vocoder to reconstruct waveforms. This is the first time for us to join the Blizzard Challenge, and the identifier for our system is G. The evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance. The problems in the system are also discussed in this paper.

show abstract

Section: The Loss For Acoustic Modelingmentioning

confidence: 99%

The HITSZ TTS system for Blizzard challenge 2020

Fu,

Zhang,

Liu

et al. 2020

Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

View full text Add to dashboard Cite

show abstract

“…1) Big gap in naturalness between generated speech and realistic speech: the existing method in unconstrained lip-to-speech adopts the MSE criterion in predicting each spectrogram frame. Such design can not capture the correlation among frequency bins in a frame, which leads to over-smoothness in spectrogram (Sheng and Pavlovskiy 2019). 2) High inference latency: the existing method utilizes the autoregressive architecture, generating current frames conditioned on previous ones.…”

Section: Introductionmentioning

confidence: 99%

Flow-Based Unconstrained Lip to Speech Generation

Zhao

Ren

et al. 2022

AAAI

View full text Add to dashboard Cite

Unconstrained lip-to-speech aims to generate corresponding speeches based on silent facial videos with no restriction to head pose or vocabulary. It is desirable to generate intelligible and natural speech with a fast speed in unconstrained settings. Currently, to handle the more complicated scenarios, most existing methods adopt the autoregressive architecture, which is optimized with the MSE loss. Although these methods have achieved promising performance, they are prone to bring issues including high inference latency and mel-spectrogram over-smoothness. To tackle these problems, we propose a novel flow-based non-autoregressive lip-to-speech model (GlowLTS) to break autoregressive constraints and achieve faster inference. Concretely, we adopt a flow-based decoder which is optimized by maximizing the likelihood of the training data and is capable of more natural and fast speech generation. Moreover, we devise a condition module to improve the intelligibility of generated speech. We demonstrate the superiority of our proposed method through objective and subjective evaluation on Lip2Wav-Chemistry-Lectures and Lip2Wav-Chess-Analysis datasets. Our demo video can be found at https://glowlts.github.io/.

show abstract

“…Early non-autoregressive TTS models (Ren et al, 2019;Peng et al, 2020) use mean absolute error (MAE) or mean square error (MSE) as loss function to model speech mel-spectrograms, implicitly assuming that data points in mel-spectrograms are independent to each other and follow a unimodal distribution 2 . Consequently, the melspectrograms following dependent and multimodal distributions cannot be well modeled by the MAE or MSE loss, which presents great challenges in non-autoregressive TTS modeling and causes over-smoothed (blurred) predictions in melspectrograms (Vasquez and Lewis, 2019;Sheng and Pavlovskiy, 2019).…”

Section: Introductionmentioning

confidence: 99%

Revisiting Over-Smoothness in Text to Speech

Ren¹,

Xu²,

Qin³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the over-smoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity.3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

show abstract

Reducing over-smoothness in speech synthesis using Generative Adversarial Networks

Cited by 4 publications

References 17 publications

The HITSZ TTS system for Blizzard challenge 2020

The HITSZ TTS system for Blizzard challenge 2020

Flow-Based Unconstrained Lip to Speech Generation

Revisiting Over-Smoothness in Text to Speech

Contact Info

Product

Resources

About