High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model

Hwang, Min-Jae; Yamamoto, Ryōichi; Song, Eun Seop; Jaemin, Kim

doi:10.21437/interspeech.2021-976

Cited by 6 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address these issues, GAN-based vocoders have been widely explored to take advantage of the compact generator size because the discriminator greatly helps the compact generator achieve high-fidelity speech generation. Parallel Wave-GAN (PWG) [35] and MelGAN [36] are the recent most popular GAN-based vocoders, and many subsequent GAN-based vocoders are based on them [11], [12], [13], [14], [25], [26], [27], [37], [38], [39], [40], [41]. Non-autoregressive models without GAN are also proposed.…”

Section: B Neural Vocoders Based On Generative Modelsmentioning

confidence: 99%

“…To further improve the F 0 controllability, we introduce F 0 -driven mechanisms designed on the basis of QP-PWG and NSF into the source network. Moreover, inspired by the recent successes of the neural vocoders that adopt harmonic-plus-noise (HN) speech modeling [13], [14], [21], [22], we introduce HN source excitation generation to obtain better sound quality. The overall architecture of uSFGAN is shown in Fig.…”

Section: Unified Source-filter Ganmentioning

confidence: 99%

“…To improve the source excitation signal modeling, especially for the unvoiced parts, we introduce a harmonic-plus-noise excitation generation mechanism inspired by the current successful works [13], [14], [21], [22] based on [50]. To explicitly model the periodic and aperiodic components, previous works [13], [14], [21], [22], [50] prepared two networks for generating each component and devised the architecture and input features for each. We adopt two harmonic-plus-noise modeling schemes, the cascade and parallel model structures, referring to Period-Net [13].…”

Section: B F 0 -Driven Source Excitation Generationmentioning

confidence: 99%

“…Moreover, the periodicity estimation is crucial for the naturalness of generated speech. Regarding NHV [22] and HN parallel waveGAN (HN-PWG) [14], we prepare a network to estimate periodicity-related weights from acoustic features and mix periodic and aperiodic source components on the basis of the weights.…”

Section: B F 0 -Driven Source Excitation Generationmentioning

confidence: 99%

“…Specifically, conventional vocoders [1], [2] based on source-filter models [3], [4], [5] can flexibly control speech characteristics, but the quality of the generated speech is low because of their over-simplified speech production process. Recent high-fidelity neural vocoders [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] lack the robustness to unseen data because of their purely data-driven training-manners. For example, the state-of-the-art neural vocoder, HiFi-GAN [12], fails to generate high-fidelity speech when the input features include F 0 values deviating from the F 0 range of the training data.…”

mentioning

confidence: 99%

See 4 more Smart Citations

High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

Yoneyama,

Wu,

Toda

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform generative model conditioned on acoustic features, which represents the source-filter architecture in a generator network. Unlike the previous neural-based source-filter models in which parametric signal process modules are combined with neural networks, our approach enables unified optimization of both the source excitation generation and resonance filtering parts to achieve higher sound quality. In the uSFGAN framework, several specific regularization losses are proposed to enable the source excitation generation part to output reasonable source excitation signals. Both objective and subjective experiments are conducted, and the results demonstrate that the proposed uSFGAN achieves comparable sound quality to HiFi-GAN in the speech reconstruction task and outperforms WORLD in the F 0 transformation task. Moreover, we argue that the F 0 -driven mechanism and the inductive bias obtained by source-filter modeling improve the robustness against unseen F 0 in training as shown by the results of experimental evaluations. Audio samples are available at our demo site at https://chomeyama.github.io/ PitchControllableNeuralVocoder-Demo/.

show abstract