HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Chen, Jiawei; Tan, Xu; Luan, Jian; Qin, Tao; Liu, Tie-Yan

doi:10.48550/arxiv.2009.01776

Cited by 28 publications

(53 citation statements)

References 30 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, in addition to L1 loss, an adversarial training method is used during the training of CpopSing. This adversarial training method is similar to the sub-frequency adversarial loss in HifiSinger [23] but with an extra multi-length adversarial loss on the spectrogram.…”

Section: Methodsmentioning

confidence: 99%

“…Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper introduces Opencpop, a publicly available highquality Mandarin singing corpus designed for singing voice synthesis (SVS). The corpus consists of 100 popular Mandarin songs performed by a female professional singer. Audio files are recorded with studio quality at a sampling rate of 44,100 Hz and the corresponding lyrics and musical scores are provided. All singing recordings have been phonetically annotated with phoneme boundaries and syllable (note) boundaries. To demonstrate the reliability of the released data and to provide a baseline for future research, we built baseline deep neural network-based SVS models and evaluated them with both objective metrics and subjective mean opinion score (MOS) measure. Experimental results show that the best SVS model trained on our database achieves 3.70 MOS, indicating the reliability of the provided corpus. Opencpop is released to the open-source community WeNet 1 , and the corpus, as well as synthesized demos, can be found on the project homepage 2 .

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Wang¹,

Wang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Choi at all [6] build a Korean singing voice synthesis system using an autoregressive algorithm that generates spectrogram with the boundary equilibrium GAN objective. Chen at all [2] introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. As the papers say, these previous SVS systems could generate natural singing voices.…”

Section: Singing Voice Synthesismentioning

confidence: 99%

“…Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices based on musical score information. Singing voice synthesis (SVS) systems [2,14,22] take music score and lyric information as input to generate singing voices, and these systems have been widely deployed in music softwares, music boxes, and so on. SVS systems could generate singing voices with comparable quality to reference songs, which attract widespread research interest.…”

Section: Introductionmentioning

confidence: 99%

“…Following the essential components similar to TTS systems, SVS systems generally adopt an acoustic model [5,23] to convert the musical scores into acoustic features, and a vocoder [2,21] to generate audio waveform from acoustic features. Neural vocoders can synthesize natural-sounding speech, which generally determines the upper bound of generated sound quality.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Huang

Chen

Ren

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective.3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline. Samples are available at https://Multi-Singer.github.io/ CCS CONCEPTS• Applied computing → Sound and music computing; • Computing methodologies → Natural language generation.

show abstract

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

et al. 2021

View full text Add to dashboard Cite

This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they have some drawbacks regarding synthesis quality or inference speed. LPCNet was proposed as a real-time neural vocoder with a mobile CPU but its sampling frequency is still only 16 kHz. In this paper, we propose a Full-band LPCNet to synthesize high-fidelity 48 kHz speech waveforms with a CPU by introducing some simple but effective modifications to the conventional LPCNet. We then evaluate the synthesis quality using both normal speech and a singing voice. The results of these experiments demonstrate that the proposed Full-band LPCNet is the only neural vocoder that can synthesize high-quality 48 kHz speech waveforms while maintaining real-time capability with a CPU.

show abstract

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Cited by 28 publications

References 30 publications

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU

Contact Info

Product

Resources

About