UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Jang, Won; Lim, Dan; Yoon, Jaesam; Kim, Bong-Wan; Kim, Jun Tae

doi:10.48550/arxiv.2106.07889

Cited by 3 publications

(10 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When trained with audio with a conventional sample rate (22050Hz or 24000Hz), evaluations had shown better audio quality and improved voice naturalness dimensions even when compared with flow-based models or autoregressive methods. Also, UnivNet [24] presented a new discriminator design based on the idea of deciding on a linear spectrogram calculated using STFT rather than on the waveform. This improvement dramatically reduces the discriminator's difficulty differentiating generated signals from ground truths and focuses more on the higher frequency component of the audio signal, which heavily affects the voice quality.…”

Section: Related Workmentioning

confidence: 99%

“…However, we use period parameters of 2,3,5,7,11 in the hope of being better suitable for 44100hz fullband generation and accelerating the training process by reducing the computational need for the discriminator. The other is the exact Multi-Resolution Discriminator we adopt from the UnviNet [24] and the same three different sets of parameters as the author mentioned in the paper. FFT sizes and hoping sizes in the three sets are perfectly compatible with the parameter sets we use for the Multi-param Mel Loss and thus avoid potential conflicts between them.…”

Section: Discriminatormentioning

confidence: 99%

See 1 more Smart Citation

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Xu¹,

Zhao²,

Guo³

2021

Preprint

View full text Add to dashboard Cite

Most GAN(Generative Adversarial Network)-based approaches towards high-fidelity waveform generation heavily rely on discriminators to improve their performance. However, the overuse of this GAN method introduces much uncertainty into the generation process and often result in mismatches of pitch and intensity, which is fatal when it comes to sensitive using cases such as singing voice synthesis(SVS).To address this problem, we propose RefineGAN, a highfidelity neural vocoder with faster-than-real-time generation capability, and focused on the robustness, pitch and intensity accuracy, and full-band audio generation. We employed a pitchguided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process and maintain the robustness of the neural vocoder while using the GAN-based training method.Audio generated using this method shows a better performance in subjective tests when compared with the ground-truth audio. This result shows that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker and the recording procedure. Moreover, a further study shows that models trained on a specified type of data can perform on totally unseen language and unseen speaker identically well. Generated sample pairs are provided on https://timedomain-tech.github.io/refinegan/ .

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Discriminatormentioning

confidence: 99%

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Xu¹,

Zhao²,

Guo³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The spoofing attacks are mainly categorized into two types: physical access (PA) and logical access (LA). The PA considers the replay attack [2][3][4] while the LA includes the spoofing attacks based on text-to-speech synthesis [5][6][7][8] and voice conversion [9][10][11] technologies.…”

Section: Introductionmentioning

confidence: 99%

Phase-Aware Spoof Speech Detection Based on Res2Net with Phase Network

Kim¹,

Ban²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

The spoof speech detection (SSD) is the essential countermeasure for automatic speaker verification systems. Although SSD with magnitude features in the frequency domain has shown promising results, the phase information also can be important to capture the artefacts of certain types of spoofing attacks. Thus, both magnitude and phase features must be considered to ensure the generalization ability to diverse types of spoofing attacks. In this paper, we investigate the failure reason of feature-level fusion of the previous works through the entropy analysis from which we found that the randomness difference between magnitude and phase features is large, which can interrupt the feature-level fusion via backend neural network; thus, we propose a phase network to reduce that difference. Our SSD system: phase network equipped Res2Net achieved significant performance improvement, specifically in the spoofing attack for which the phase information is considered to be important. Also, we demonstrate our SSD system in both knownand unknown-kind SSD scenarios for practical applications.

show abstract

“…When the generator reaches a Nash equalization point, it is expected to synthesize a high-quality waveform. GAN-based methods ( [2], [1], [21], [22], [23], [24], [3], [4], [25] etc.) are promising, as some models are even capable of synthesizing waves in realtime on a single GPU or even CPU while achieving a comparable MOS that is very suitable for actual industrial use.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, HiFiGAN [3] proposed a novel multi-period discriminator and achieved the state of the art in wave quality and realtime speed at CPU. UnivNet [4] and Universal MelGAN [26] also propose multi-resolution spectrogram discriminators using a 2D convolution-based discriminator in the frequency domain to eliminate high-frequency artifacts, such as mental noise and reverberation in the auditory domain. StyleMelGAN [25] synthesizes high-quality waves by using the Adaptive Batch Normalization block conditioned by the Mel spectrogram.…”

Section: Introductionmentioning

confidence: 99%

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Wang¹,

Yi²

2022

Preprint

View full text Add to dashboard Cite

Recently, GAN-based neural vocoders such as Parallel WaveGAN[1], MelGAN[2], HiFiGAN[3], and UnivNet[4] have become popular due to their lightweight and parallel structure, resulting in a real-time synthesized waveform with high fidelity, even on a CPU. HiFiGAN[3] and UnivNet[4] are two SOTA vocoders. Despite their high quality, there is still room for improvement. In this paper, motivated by the structure of Vision Outlooker from computer vision, we adopt a similar idea and propose an effective and lightweight neural vocoder called WOLONet. In this network, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights. To demonstrate the effectiveness and generalizability of our method, we perform an ablation study to verify our novel design and make a subjective and objective comparison with typical GAN-based vocoders. The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, i.e., HiFiGAN and UnivNet.

show abstract

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Cited by 3 publications

References 24 publications

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

Phase-Aware Spoof Speech Detection Based on Res2Net with Phase Network

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Contact Info

Product

Resources

About