2019
DOI: 10.48550/arxiv.1908.01919
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
22
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(22 citation statements)
references
References 0 publications
0
22
0
Order By: Relevance
“…For singing voice conversion, [13] adapted AutoVC by conditioning the network on pitch contours transposed to a suitable register for the converted singing, achievable through the implementation of a vocoder. [8] utilised a Wasserstein-GAN framework, using a decoder for pitch contours and another for generating 'formant masks'. The product of these two decoders is the estimated mel-spectrogram for singing.…”
Section: Related Workmentioning
confidence: 99%
“…For singing voice conversion, [13] adapted AutoVC by conditioning the network on pitch contours transposed to a suitable register for the converted singing, achievable through the implementation of a vocoder. [8] utilised a Wasserstein-GAN framework, using a decoder for pitch contours and another for generating 'formant masks'. The product of these two decoders is the estimated mel-spectrogram for singing.…”
Section: Related Workmentioning
confidence: 99%
“…One approach to dealing with this lack of labels for underlying non-textual information is to look for hand engineered statistics based on the audio that we believe are correlated with this underlying information. This is the approach taken by models like (Nishimura et al, 2016;Lee et al, 2019), wherein utterances are conditioned on audio statistics that can be calculated directly from the training data such as F 0 (fundamental frequency). However, in order to use such models, the statistics we hope to approximate must be decided upon a-priori, and the target value of these statistics must be determined before synthesis.…”
Section: Related Workmentioning
confidence: 99%
“…Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices based on musical score information, and attracts a lot of attention in both industry and academia (especially in the machine learning and speech signal processing community) (Umbert et al, 2015;Nishimura et al, 2016;Blaauw & Bonada, 2017;Nakamura et al, 2019;Hono et al, 2019;Chandna et al, 2019;Lee et al, 2019;Lu et al, 2020;Blaauw & Bonada, 2020;Gu et al, 2020;Ren et al, 2020b). Singing voice synthesis shares similar pipeline with text to speech synthesis, and has achieved rapid progress (Blaauw & Bonada, 2017;Nakamura et al, 2019;Lee et al, 2019;Blaauw & Bonada, 2020;Gu et al, 2020) with the techniques developed in text to speech synthesis (Shen et al, 2018;Ren et al, 2019;2020a;.…”
Section: Introductionmentioning
confidence: 99%
“…Most previous works on SVS (Lee et al, 2019;Gu et al, 2020) adopt the same sampling rate (e.g., 16kHz or 24kHz) as used in text to speech, where the frequency bands or sampling data points are not enough to convey expression and emotion as in high-fidelity singing voices. However, simply increasing the sampling rate will cause several challenges in singing modeling.…”
Section: Introductionmentioning
confidence: 99%