Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components

Hono, Yukiya; Takaki, Shinji; Hashimoto, Kei; Oura, Keiichiro; Nankaku, Yoshihiko; Tokuda, Keiichi

doi:10.1109/icassp39728.2021.9414401

Cited by 9 publications

(7 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address these issues, GAN-based vocoders have been widely explored to take advantage of the compact generator size because the discriminator greatly helps the compact generator achieve high-fidelity speech generation. Parallel Wave-GAN (PWG) [35] and MelGAN [36] are the recent most popular GAN-based vocoders, and many subsequent GAN-based vocoders are based on them [11], [12], [13], [14], [25], [26], [27], [37], [38], [39], [40], [41]. Non-autoregressive models without GAN are also proposed.…”

Section: B Neural Vocoders Based On Generative Modelsmentioning

confidence: 99%

“…To further improve the F 0 controllability, we introduce F 0 -driven mechanisms designed on the basis of QP-PWG and NSF into the source network. Moreover, inspired by the recent successes of the neural vocoders that adopt harmonic-plus-noise (HN) speech modeling [13], [14], [21], [22], we introduce HN source excitation generation to obtain better sound quality. The overall architecture of uSFGAN is shown in Fig.…”

Section: Unified Source-filter Ganmentioning

confidence: 99%

“…To improve the source excitation signal modeling, especially for the unvoiced parts, we introduce a harmonic-plus-noise excitation generation mechanism inspired by the current successful works [13], [14], [21], [22] based on [50]. To explicitly model the periodic and aperiodic components, previous works [13], [14], [21], [22], [50] prepared two networks for generating each component and devised the architecture and input features for each. We adopt two harmonic-plus-noise modeling schemes, the cascade and parallel model structures, referring to Period-Net [13].…”

Section: B F 0 -Driven Source Excitation Generationmentioning

confidence: 99%

“…Specifically, conventional vocoders [1], [2] based on source-filter models [3], [4], [5] can flexibly control speech characteristics, but the quality of the generated speech is low because of their over-simplified speech production process. Recent high-fidelity neural vocoders [6], [7], [8], [9], [10], [11], [12], [13], [14], [15] lack the robustness to unseen data because of their purely data-driven training-manners. For example, the state-of-the-art neural vocoder, HiFi-GAN [12], fails to generate high-fidelity speech when the input features include F 0 values deviating from the F 0 range of the training data.…”

mentioning

confidence: 99%

See 3 more Smart Citations

High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

Yoneyama,

Wu,

Toda

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform generative model conditioned on acoustic features, which represents the source-filter architecture in a generator network. Unlike the previous neural-based source-filter models in which parametric signal process modules are combined with neural networks, our approach enables unified optimization of both the source excitation generation and resonance filtering parts to achieve higher sound quality. In the uSFGAN framework, several specific regularization losses are proposed to enable the source excitation generation part to output reasonable source excitation signals. Both objective and subjective experiments are conducted, and the results demonstrate that the proposed uSFGAN achieves comparable sound quality to HiFi-GAN in the speech reconstruction task and outperforms WORLD in the F 0 transformation task. Moreover, we argue that the F 0 -driven mechanism and the inductive bias obtained by source-filter modeling improve the robustness against unseen F 0 in training as shown by the results of experimental evaluations. Audio samples are available at our demo site at https://chomeyama.github.io/ PitchControllableNeuralVocoder-Demo/.

show abstract

Section: B Neural Vocoders Based On Generative Modelsmentioning

confidence: 99%

Section: Unified Source-filter Ganmentioning

confidence: 99%

Section: B F 0 -Driven Source Excitation Generationmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

Yoneyama,

Wu,

Toda

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…In [95], a non-autoregressive neural vocoder called Period-Net [107] is adopted, which is a non-autoregressive GANbased neural vocoder that is shown to be more robust for generating accurate pitch. Moreover, an automatic pitch correction technique is incorporated that ensures accurate pitch in the synthesized singing voices.…”

Section: Multi-variate Density Outputmentioning

confidence: 99%

Deep Learning Approaches in Topics of Singing Information Processing

Gupta

Goto

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Singing, the vocal production of musical tones, is one of the most important elements of music. Addressing the needs of real-world applications, the study of technologies related to singing voices has become an increasingly active area of research. In this paper, we provide a comprehensive overview of the recent developments in the field of singing information processing, specifically in the topics of singing skill evaluation, singing voice synthesis, singing voice separation, and lyrics synchronization and transcription. We will especially focus on deep learning approaches including modern representation learning techniques for singing voices. We will also provide an overview of contributions in public datasets for singing voice research.

show abstract

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

et al. 2021

View full text Add to dashboard Cite

This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they have some drawbacks regarding synthesis quality or inference speed. LPCNet was proposed as a real-time neural vocoder with a mobile CPU but its sampling frequency is still only 16 kHz. In this paper, we propose a Full-band LPCNet to synthesize high-fidelity 48 kHz speech waveforms with a CPU by introducing some simple but effective modifications to the conventional LPCNet. We then evaluate the synthesis quality using both normal speech and a singing voice. The results of these experiments demonstrate that the proposed Full-band LPCNet is the only neural vocoder that can synthesize high-quality 48 kHz speech waveforms while maintaining real-time capability with a CPU.

show abstract

Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components

Cited by 9 publications

References 20 publications

High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

High-Fidelity and Pitch-Controllable Neural Vocoder Based on Unified Source-Filter Networks

Deep Learning Approaches in Topics of Singing Information Processing

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

Contact Info

Product

Resources

About