Adversarially Trained End-to-end Korean Singing Voice Synthesis System

Lee, Juheon; Choi, Hyeong-Seok; Jeon, Chang-Bin; Junghyun, Koo,; Lee, Kyogu

doi:10.48550/arxiv.1908.01919

Cited by 8 publications

(22 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For singing voice conversion, [13] adapted AutoVC by conditioning the network on pitch contours transposed to a suitable register for the converted singing, achievable through the implementation of a vocoder. [8] utilised a Wasserstein-GAN framework, using a decoder for pitch contours and another for generating 'formant masks'. The product of these two decoders is the estimated mel-spectrogram for singing.…”

Section: Related Workmentioning

confidence: 99%

Zero-shot Singing Technique Conversion

O’Connor¹,

Dixon²,

Fazekas³

2021

Preprint

View full text Add to dashboard Cite

In this paper we propose modifications to the neural network framework, AutoVC [17] for the task of singing technique conversion. This includes utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training. By swapping out a source singer's technique information for that of the target's during conversion, the input spectrogram is reconstructed with the target's technique. We document the beneficial effects of omitting the latent loss, the importance of sequential training, and our process for fine-tuning the bottleneck. We also conducted a listening study where participants rate the specificity of technique-converted voices as well as their naturalness. From this we are able to conclude how effective the technique conversions are and how different conditions affect them, while assessing the model's ability to reconstruct its input data.

show abstract

Section: Related Workmentioning

confidence: 99%

Zero-shot Singing Technique Conversion

O’Connor¹,

Dixon²,

Fazekas³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…One approach to dealing with this lack of labels for underlying non-textual information is to look for hand engineered statistics based on the audio that we believe are correlated with this underlying information. This is the approach taken by models like (Nishimura et al, 2016;Lee et al, 2019), wherein utterances are conditioned on audio statistics that can be calculated directly from the training data such as F 0 (fundamental frequency). However, in order to use such models, the statistics we hope to approximate must be decided upon a-priori, and the target value of these statistics must be determined before synthesis.…”

Section: Related Workmentioning

confidence: 99%

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Valle,

Shih,

Prenger

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper we propose Flowtron: an autoregressive flow-based generative network for textto-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive melspectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pretrained models will be made publicly available at https://github.com/NVIDIA/flowtron.

show abstract

“…Singing voice synthesis (SVS) aims to synthesize high-quality and expressive singing voices based on musical score information, and attracts a lot of attention in both industry and academia (especially in the machine learning and speech signal processing community) (Umbert et al, 2015;Nishimura et al, 2016;Blaauw & Bonada, 2017;Nakamura et al, 2019;Hono et al, 2019;Chandna et al, 2019;Lee et al, 2019;Lu et al, 2020;Blaauw & Bonada, 2020;Gu et al, 2020;Ren et al, 2020b). Singing voice synthesis shares similar pipeline with text to speech synthesis, and has achieved rapid progress (Blaauw & Bonada, 2017;Nakamura et al, 2019;Lee et al, 2019;Blaauw & Bonada, 2020;Gu et al, 2020) with the techniques developed in text to speech synthesis (Shen et al, 2018;Ren et al, 2019;2020a;.…”

Section: Introductionmentioning

confidence: 99%

“…Most previous works on SVS (Lee et al, 2019;Gu et al, 2020) adopt the same sampling rate (e.g., 16kHz or 24kHz) as used in text to speech, where the frequency bands or sampling data points are not enough to convey expression and emotion as in high-fidelity singing voices. However, simply increasing the sampling rate will cause several challenges in singing modeling.…”

Section: Introductionmentioning

confidence: 99%

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Chen¹,

Tan²,

Luan³

et al. 2020

Preprint

View full text Add to dashboard Cite

High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz, compared with 16kHz or 24kHz in speaking voices) with large range of frequency to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing modeling in both frequency and time domains in singing voice synthesis (SVS). Conventional SVS systems that adopt moderate sampling rate (e.g., 16kHz or 24kHz) cannot well address the above challenges. In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice using 48kHz sampling rate. HiFiSinger consists of a FastSpeech based neural acoustic model and a Parallel WaveGAN based neural vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To handle the larger range of frequencies caused by higher sampling rate (e.g., 48kHz vs. 24kHz), we propose a novel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands (e.g. low, middle and high frequency bands) and models each sub-band with a separate discriminator. 2) To model longer waveform sequences caused by higher sampling rate, we propose a multi-length GAN (ML-GAN) for waveform generation to model different lengths of waveform sequences with separate discriminators. 3) We also introduce several additional designs and findings in HiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch) and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriate window/hop size for mel-spectrogram, and increasing the receptive field in vocoder for long vowel modeling in singing voices. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality: 0.32/0.44 MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems. Audio samples are available at https://speechresearch.github.io/hifisinger/.

show abstract

Adversarially Trained End-to-end Korean Singing Voice Synthesis System

Cited by 8 publications

References 0 publications

Zero-shot Singing Technique Conversion

Zero-shot Singing Technique Conversion

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Contact Info

Product

Resources

About