2022
DOI: 10.48550/arxiv.2201.03864
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Abstract: Multi-speaker singing voice synthesis is to generate the singing voice sung by different speakers. To generalize to new speakers, previous zero-shot singing adaptation methods obtain the timbre of the target speaker with a fixed-size embedding from single reference audio. However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3)… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 15 publications
0
4
0
Order By: Relevance
“…Multi-singer SVS In the speech synthesis fields, many multi-speaker TTS models reflect speaker ID by feeding a fixed-size speaker embedding into the decoder [34,35]. Most multi-singer SVS models represent singer characteristics in similar ways [31,36,37]. However, learning the characteristics requires sufficient data for each singer.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Multi-singer SVS In the speech synthesis fields, many multi-speaker TTS models reflect speaker ID by feeding a fixed-size speaker embedding into the decoder [34,35]. Most multi-singer SVS models represent singer characteristics in similar ways [31,36,37]. However, learning the characteristics requires sufficient data for each singer.…”
Section: Related Workmentioning
confidence: 99%
“…[38] learns the characteristics of each singer by adapting a pre-trained SVS model to a target singer. [36,37] present zero-shot style adaptation methods that apply reference encoders that extract singer embeddings from the reference audio. In particular, [37] applies multiple reference encoders and multi-head attention to reflect singer characteristics more effectively.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…To further distinguish vowels and consonants, a duration predictor is built to produce fine-grained *Corresponding author. phoneme-level duration, which is trained based on supervision calculated by force-alignment [6][7][8][9][10][11], heuristics [12][13][14][15] etc. The advantage of this type of feature processing strategy is that the input phoneme and pitch sequence are strictly aligned at the note level based on the music score.…”
Section: Introductionmentioning
confidence: 99%