Litesing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis

Zhuang, Xiaobin; Jiang, Tao; Chou, Szu-Yu; Wu, Bin; Peng, Hu; Lui, Simon

doi:10.1109/icassp39728.2021.9414043

Cited by 11 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To qualitatively examine the controllability of the proposed methods, we tried various style modifications by manipulating the initial LST sequence and f0 contour 2 .…”

Section: Qualitative Analysismentioning

confidence: 99%

“…Recently, interest in research on the SVS system that can reflect musical expression is increasing. A method of explicitly modeling information such as pitch curves, energy, V/UV., which can be extracted directly from the vocal signal, was proposed in [2]. [3] proposed a method to interpret the music score more naturally by introducing a module that predicts the difference between the actual singing and the score.…”

Section: Related Workmentioning

confidence: 99%

“…Singing voice synthesis (SVS) is the task of generating a natural singing voice from a given musical score. With the development of various deep generative models, research on synthesizing high-quality singing voice has been emerging recently [1,2,3,4]. As the performance of the SVS improves, there are increasing cases in which the technology is applied to the production of actual music content [5].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder

Lee¹,

Choi²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

This paper proposes a controllable singing voice synthesis system capable of generating expressive singing voice with two novel methodologies. First, a local style token module, which predicts frame-level style tokens from an input pitch and text sequence, is proposed to allow the singing voice system to control musical expression often unspecified in sheet music (e.g., breathing and intensity). Second, we propose a dual-path pitch encoder with a choice of two different pitch inputs: MIDI pitch sequence or f0 contour. Because the initial generation of a singing voice is usually executed by taking a MIDI pitch sequence, one can later extract an f0 contour from the generated singing voice and modify the f0 contour to a finer level as desired. Through quantitative and qualitative evaluations, we confirmed that the proposed model could control various musical expressions while not sacrificing the sound quality of the singing voice synthesis system.

show abstract

“…To qualitatively examine the controllability of the proposed methods, we tried various style modifications by manipulating the initial LST sequence and f0 contour 2 .…”

Section: Qualitative Analysismentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder

Lee¹,

Choi²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Singing voice synthesis (SVS) systems [1]- [7] generate singing voices from musical scores which contain music information such as lyrics, tempo, pitch, etc. SVS is similar to the text-to-speech (TTS) task [8]- [13] in terms of generating speech from text.…”

Section: Introductionmentioning

confidence: 99%

“…For example, to predict pitch feature better, Yi et al [18] utilized deep autoregressive network to capture the dependencies among the consecutive acoustic features. Zhuang et al [1] separated the pitch feature from the acoustic feature to avoid the interdependence between these pitch features and the timbre features. Ren et al [9] introduced the pitch and energy information into the speech generation task and presented variance adaptors to make the generated audio expressive.…”

Section: Introductionmentioning

confidence: 99%

Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

Song

Zhang

et al. 2022

2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP)

View full text Add to dashboard Cite

This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, where the vibrato likeliness represents the existence probability of vibrato and it would help improve the singing voice's naturalness. Actually, there is no annotated label about vibrato likeliness in existing singing corpus. We adopt a novel vibrato likeliness labeling method to label the vibrato likeliness automatically. Meanwhile, the power spectrogram of audio contains rich information that can improve the expressiveness of singing. An autoencoder-based latent energy bottleneck feature is proposed for expressive singing voice synthesis. Experimental results on the open dataset NUS48E show that both the vibrato modeling and the latent energy representation could significantly improve the expressiveness of singing voice. The audio samples are shown in the demo website 1 .

show abstract