MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Wang, Shoutong; Liu, Jinglin; Ren, Yang; Wang, Zhen; Chen, Xu; Zhang, Zhao

doi:10.48550/arxiv.2201.03864

Cited by 2 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-singer SVS In the speech synthesis fields, many multi-speaker TTS models reflect speaker ID by feeding a fixed-size speaker embedding into the decoder [34,35]. Most multi-singer SVS models represent singer characteristics in similar ways [31,36,37]. However, learning the characteristics requires sufficient data for each singer.…”

Section: Related Workmentioning

confidence: 99%

“…[38] learns the characteristics of each singer by adapting a pre-trained SVS model to a target singer. [36,37] present zero-shot style adaptation methods that apply reference encoders that extract singer embeddings from the reference audio. In particular, [37] applies multiple reference encoders and multi-head attention to reflect singer characteristics more effectively.…”

Section: Related Workmentioning

confidence: 99%

“…[36,37] present zero-shot style adaptation methods that apply reference encoders that extract singer embeddings from the reference audio. In particular, [37] applies multiple reference encoders and multi-head attention to reflect singer characteristics more effectively. On the other hand, [39] proposes a method to disentangle timbre from singing style and control them separately.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Kim¹,

Na²,

Choonghyeon³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose U-Singer, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity. During synthesizing singing voices according to the lyrics, pitch, and duration of the music score, U-Singer reflects singer characteristics and emotional intensity by adding variances in pitch, energy, and phoneme duration according to singer ID and emotional intensity. Representing all attributes by conditional residual embeddings in a single unified embedding space, U-Singer controls mutually correlated style attributes, minimizing interference. Additionally, we apply emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the model to express emotional intensity levels not included in the training data. In experiments, U-Singer synthesized high-fidelity singing voices reflecting the singer ID and emotional intensity. The visualization of the unified embedding space exhibits that U-singer estimates the correct variations in pitch and energy highly correlated with the singer ID and emotional intensity level. The audio samples are presented at https://u-singer.github.io. * arXiv preprint.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

Kim¹,

Na²,

Choonghyeon³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To further distinguish vowels and consonants, a duration predictor is built to produce fine-grained *Corresponding author. phoneme-level duration, which is trained based on supervision calculated by force-alignment [6][7][8][9][10][11], heuristics [12][13][14][15] etc. The advantage of this type of feature processing strategy is that the input phoneme and pitch sequence are strictly aligned at the note level based on the music score.…”

Section: Introductionmentioning

confidence: 99%

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Wu¹,

Shi²,

Tao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Singing voice synthesis (SVS), as a specific task for generating the vocal singing voice from a music score, has drawn much attention in recent years. SVS faces the challenge that the singing has various pronunciation flexibility conditioned on the same music score. Most of the previous works of SVS can not well handle the misalignment between the music score and actual singing. In this paper, we propose an acoustic feature processing strategy, named PHONEix, with a phoneme distribution predictor, to alleviate the gap between the music score and the singing voice, which can be easily adopted in different SVS systems. Extensive experiments in various settings demonstrate the effectiveness of our PHONEix in both objective and subjective evaluations.

show abstract

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Cited by 2 publications

References 15 publications

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity

PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Contact Info

Product

Resources

About