XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

Lu, Peiling; Wu, Jie; Luan, Jian; Tan, Xu; Zhou, Li

doi:10.21437/interspeech.2020-1410

Cited by 45 publications

(47 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In singing synthesis, several works aim to go towards a reduction in the burden of dataset annotation. In particu- lar, sequence-to-sequence models generally avoid the need of detailed phonetic segmentation, but do require a fairly well aligned musical score with lyrics [2,3,4,5,6,7,8]. Similarly voice cloning techniques require only a small amount of training data with phonetic segmentation for the target voice (e.g.…”

Section: Relation To Prior Workmentioning

confidence: 99%

“…Singing synthesis has recently seen a notable uptick in research activity, and, inspired by modern deep learning techniques developed for text-to-speech (TTS), great strides have been made, e.g. [1,2,3,4,5,6,7,8]. To create a new voice for these models, generally a supervised approach is used, meaning that besides recordings of the target singer, phonetic segmentation or a reasonably well-aligned score with lyrics is needed.…”

Section: Introductionmentioning

confidence: 99%

“…While it is possible to combine these two steps, e.g. [7,8], in this work we use the two-step approach and focus on the timbre model only. The main reason for this is that we consider both tasks to have notably different requirements and constraints, and thus prefer to focus on each step individually.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Semi-Supervised Learning for Singing Synthesis Timbre

Bonada

Blaauw

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder. In a first step, the system is trained in a supervised manner, using a labeled multi-singer dataset. Here, we ensure that the embeddings produced by both encoders are similar, so that we can later use the model with either acoustic or linguistic input features. To learn a new voice in an unsupervised manner, the pretrained acoustic encoder is used to train a decoder for the target singer. Finally, at inference, the pretrained linguistic encoder is used together with the decoder of the new voice, to produce acoustic features from linguistic input. We evaluate our system with a listening test and show that the results are comparable to those obtained with an equivalent supervised approach.

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semi-Supervised Learning for Singing Synthesis Timbre

Bonada

Blaauw

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…As sequence-to-sequence (Seq2Seq) models have become the predominant architectures in neural-based TTS, state-of-the-art SVS systems have also adopted the encoder-decoder methods and showed improved performance over simple network structure (e.g., DNN, CNN, RNN) [17][18][19][20][21][22][23]. In these methods, the encoders and decoders vary from bi-directional Long-Short-Term Memory units (LSTM) to multi-head self-attention (MHSA) based blocks.…”

Section: Introductionmentioning

confidence: 99%

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Shi

Guo

Huo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation cost, . In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.

show abstract

“…WGANSing [11] introduced an adversarial singing synthesis approach based on U-Net architecture, optimized the network using the Wasserstein-GAN (WGAN) loss function [12]. XiaoiceSing [13] adopted the architecture design of FastSpeech [14], which stacked a self-attention mechanism of Transformer and a 1D convolutional network. For further improvement of FastSpeech, FastSpeech2 [15] introduced a Variance Adaptor, which predicted duration, pitch and energy to ease the one-to-many mapping problem.…”

Section: Introductionmentioning

confidence: 99%

Litesing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis

Zhuang

Jiang

Chou

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

LiteSing proposed in this paper is a high-quality singing voice synthesis (SVS) system, which is fast, lightweight and expressive. This model mainly stacks several non-autoregressive WaveNet blocks in the encoder and decoder under a generative adversarial architecture, predicts full conditions from the musical score, and generates acoustic features from these conditions. The full conditions in this paper consist of dynamic spectrogram energy, voiced/unvoiced (V/UV) decision and dynamic pitch curve, which are proven related to the expressiveness. We predict the pitch and the timbre features separately, avoiding the interdependence between these two features. Instead of neural network vocoders, a parametric WORLD vocoder is employed for the pitch curve consistency. Experiment results show that LiteSing outperforms the baseline model using feed-forward Transformer by 1.386 times faster on inference speed, 15 times smaller on training parameters number, and achieves a similar MOS on sound quality. Through an A/B test, LiteSing achieves 67.3% preference rate over baseline in pitch curve and dynamic spectrogram energy prediction. which demonstrates the advantage of LiteSing over the other compared models.

show abstract

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

Cited by 45 publications

References 14 publications

Semi-Supervised Learning for Singing Synthesis Timbre

Semi-Supervised Learning for Singing Synthesis Timbre

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Litesing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis

Contact Info

Product

Resources

About