2020
DOI: 10.48550/arxiv.2009.01776
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Abstract: High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz, compared with 16kHz or 24kHz in speaking voices) with large range of frequency to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing modeling in both frequency and time domains in singing voice synthesis (SVS). Conventional SVS systems that adopt moderate sampling rate (e.g., 16kHz or 24kHz) cannot well address the above challe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
53
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 28 publications
(53 citation statements)
references
References 30 publications
(45 reference statements)
0
53
0
Order By: Relevance
“…Therefore, in addition to L1 loss, an adversarial training method is used during the training of CpopSing. This adversarial training method is similar to the sub-frequency adversarial loss in HifiSinger [23] but with an extra multi-length adversarial loss on the spectrogram.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, in addition to L1 loss, an adversarial training method is used during the training of CpopSing. This adversarial training method is similar to the sub-frequency adversarial loss in HifiSinger [23] but with an extra multi-length adversarial loss on the spectrogram.…”
Section: Methodsmentioning
confidence: 99%
“…Generally, with a well-trained neural acoustic model [2,5,6,7] and a neural vocoder [8,9,10,11], or alternatively using fully end-to-end models [12,13,14] which directly construct wave signals from text input, it is able to synthesize high-quality neutral speech. Recently, much attention has been attracted to synthesizing expressive speech, such as stylized speech [15,16], emotional speech [17,18,19,20,21,22], and also singing voice [23,24].…”
Section: Introductionmentioning
confidence: 99%
“…Choi at all [6] build a Korean singing voice synthesis system using an autoregressive algorithm that generates spectrogram with the boundary equilibrium GAN objective. Chen at all [2] introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. As the papers say, these previous SVS systems could generate natural singing voices.…”
Section: Singing Voice Synthesismentioning
confidence: 99%
“…Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices based on musical score information. Singing voice synthesis (SVS) systems [2,14,22] take music score and lyric information as input to generate singing voices, and these systems have been widely deployed in music softwares, music boxes, and so on. SVS systems could generate singing voices with comparable quality to reference songs, which attract widespread research interest.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation