Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1410
|View full text |Cite
|
Sign up to set email alerts
|

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

Abstract: This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
47
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 45 publications
(47 citation statements)
references
References 14 publications
0
47
0
Order By: Relevance
“…In singing synthesis, several works aim to go towards a reduction in the burden of dataset annotation. In particu- lar, sequence-to-sequence models generally avoid the need of detailed phonetic segmentation, but do require a fairly well aligned musical score with lyrics [2,3,4,5,6,7,8]. Similarly voice cloning techniques require only a small amount of training data with phonetic segmentation for the target voice (e.g.…”
Section: Relation To Prior Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In singing synthesis, several works aim to go towards a reduction in the burden of dataset annotation. In particu- lar, sequence-to-sequence models generally avoid the need of detailed phonetic segmentation, but do require a fairly well aligned musical score with lyrics [2,3,4,5,6,7,8]. Similarly voice cloning techniques require only a small amount of training data with phonetic segmentation for the target voice (e.g.…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Singing synthesis has recently seen a notable uptick in research activity, and, inspired by modern deep learning techniques developed for text-to-speech (TTS), great strides have been made, e.g. [1,2,3,4,5,6,7,8]. To create a new voice for these models, generally a supervised approach is used, meaning that besides recordings of the target singer, phonetic segmentation or a reasonably well-aligned score with lyrics is needed.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…As sequence-to-sequence (Seq2Seq) models have become the predominant architectures in neural-based TTS, state-of-the-art SVS systems have also adopted the encoder-decoder methods and showed improved performance over simple network structure (e.g., DNN, CNN, RNN) [17][18][19][20][21][22][23]. In these methods, the encoders and decoders vary from bi-directional Long-Short-Term Memory units (LSTM) to multi-head self-attention (MHSA) based blocks.…”
Section: Introductionmentioning
confidence: 99%
“…WGANSing [11] introduced an adversarial singing synthesis approach based on U-Net architecture, optimized the network using the Wasserstein-GAN (WGAN) loss function [12]. XiaoiceSing [13] adopted the architecture design of FastSpeech [14], which stacked a self-attention mechanism of Transformer and a 1D convolutional network. For further improvement of FastSpeech, FastSpeech2 [15] introduced a Variance Adaptor, which predicted duration, pitch and energy to ease the one-to-many mapping problem.…”
Section: Introductionmentioning
confidence: 99%