Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Lee, Juheon; Choi, Hyeong-Seok; Jeon, Chang-Bin; Junghyun, Koo,; Lee, Kyogu

doi:10.21437/interspeech.2019-1722

Cited by 73 publications

(96 citation statements)

References 20 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, several kinds of DNN-based singing voice synthesis systems [4,17,18,19,20] have been proposed. In the training part of the basic system [4], parameters for spectrum (e.g., melcepstral coefficients), excitation, and vibrato are extracted from a singing voice database as acoustic features.…”

Section: Dnn-based Singing Voice Synthesismentioning

confidence: 99%

Fast and High-Quality Singing Voice Synthesis System Based on Convolutional Neural Networks

Nakamura¹,

Takaki²,

Hashimoto³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The present paper describes singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing voices. As singing voices represent a rich form of expression, a powerful technique to model them accurately is required. In the proposed technique, long-term dependencies of singing voices are modeled by CNNs. An acoustic feature sequence is generated for each segment that consists of long-term frames, and a natural trajectory is obtained without the parameter generation algorithm. Furthermore, a computational complexity reduction technique, which drives the DNNs in different time units depending on type of musical score features, is proposed. Experimental results show that the proposed method can synthesize natural sounding singing voices much faster than the conventional method.

show abstract

Section: Dnn-based Singing Voice Synthesismentioning

confidence: 99%

Fast and High-Quality Singing Voice Synthesis System Based on Convolutional Neural Networks

Nakamura¹,

Takaki²,

Hashimoto³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We propose a multi-singer SVS system that can model timbre and singing styles independently. We designed the network with [8] as the baseline and extended the existing model to the multi-singer model by adding 1) singer identity encoder and 2) timbre/singing style conditioning method. As shown in Fig.…”

Section: Proposed Systemmentioning

confidence: 99%

“…Finally, to create a linear spectrogram that is more realistic, we applied adversarial training and added a discriminator to this end. Please refer to [8] for more detailed information on each module of the network. The summary of the generation process of the entire network is as follows:…”

Section: Proposed Systemmentioning

confidence: 99%

See 1 more Smart Citation

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

Lee

Choi

Junghyun

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this study, we define the identity of the singer with two independent concepts -timbre and singing style -and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into a multi-singer model in the following ways: first, we design a singer identity encoder that can adequately reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.

show abstract

“…By accounting for melodic information such as pitch and rhythm, expressive speech synthesis with Mellotron can be easily extended to singing voice synthesis (SVS) [3,4]. Unfortunately, recent attempts [4] require a singing voice dataset and heavily quantized pitch and rhythm data obtained from a digital representation of a music score, for example MIDI [5] or musicXML [6]. Mellotron does not require any singing voice in the dataset nor manually aligned pitch and text in order to synthesize singing voice.…”

Section: Introductionmentioning

confidence: 99%

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Valle

Prenger

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

show abstract

Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Cited by 73 publications

References 20 publications

Fast and High-Quality Singing Voice Synthesis System Based on Convolutional Neural Networks

Fast and High-Quality Singing Voice Synthesis System Based on Convolutional Neural Networks

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Contact Info

Product

Resources

About