ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414348
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss

Abstract: The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation cost, . In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 33 publications
0
8
0
Order By: Relevance
“…The experiments are conducted using the music processing toolkit Muskits [28], which is adapted from ESPnet [30]. More details about the codes can be found on github 2 .…”
Section: Experimental Settingsmentioning
confidence: 99%
See 3 more Smart Citations
“…The experiments are conducted using the music processing toolkit Muskits [28], which is adapted from ESPnet [30]. More details about the codes can be found on github 2 .…”
Section: Experimental Settingsmentioning
confidence: 99%
“…To be specific, we utilize 384-dimensional embedding layers for lyrics, notes, and note durations. The encoder and decoder of the first model are both three-layer 256-dimensional bidirectional Long-Short Term Memory units (Bi-LSTM) following [2]. The second network utilizes a Transformer structure from [14].…”
Section: Experimental Settingsmentioning
confidence: 99%
See 2 more Smart Citations
“…Ren et al [28] used a forward transformerbased network to perform end-to-end singing voice synthesis, directly generating a linear spectrum and then obtaining the singing voice through the Griffin-Lim vocoder. Shi et al [29] combined the perceptual entropy loss function with mainstream time sequence models, including RNN, transformer, and conformer for singing voice synthesis. Xue et al [30] used an acoustic model of the encoder-decoder architecture to perform end-to-end training on frame-level input.…”
Section: Introductionmentioning
confidence: 99%