Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1722
|View full text |Cite
|
Sign up to set email alerts
|

Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Abstract: In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the superresolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrog… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
94
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 73 publications
(96 citation statements)
references
References 20 publications
(30 reference statements)
0
94
0
Order By: Relevance
“…In recent years, several kinds of DNN-based singing voice synthesis systems [4,17,18,19,20] have been proposed. In the training part of the basic system [4], parameters for spectrum (e.g., melcepstral coefficients), excitation, and vibrato are extracted from a singing voice database as acoustic features.…”
Section: Dnn-based Singing Voice Synthesismentioning
confidence: 99%
“…In recent years, several kinds of DNN-based singing voice synthesis systems [4,17,18,19,20] have been proposed. In the training part of the basic system [4], parameters for spectrum (e.g., melcepstral coefficients), excitation, and vibrato are extracted from a singing voice database as acoustic features.…”
Section: Dnn-based Singing Voice Synthesismentioning
confidence: 99%
“…We propose a multi-singer SVS system that can model timbre and singing styles independently. We designed the network with [8] as the baseline and extended the existing model to the multi-singer model by adding 1) singer identity encoder and 2) timbre/singing style conditioning method. As shown in Fig.…”
Section: Proposed Systemmentioning
confidence: 99%
“…Finally, to create a linear spectrogram that is more realistic, we applied adversarial training and added a discriminator to this end. Please refer to [8] for more detailed information on each module of the network. The summary of the generation process of the entire network is as follows:…”
Section: Proposed Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…By accounting for melodic information such as pitch and rhythm, expressive speech synthesis with Mellotron can be easily extended to singing voice synthesis (SVS) [3,4]. Unfortunately, recent attempts [4] require a singing voice dataset and heavily quantized pitch and rhythm data obtained from a digital representation of a music score, for example MIDI [5] or musicXML [6]. Mellotron does not require any singing voice in the dataset nor manually aligned pitch and text in order to synthesize singing voice.…”
Section: Introductionmentioning
confidence: 99%