WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Chandna, Pritish; Blaauw, Merlijn; Bonada, Jordi; Gomez, Emilia

doi:10.48550/arxiv.1903.10729

Cited by 5 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neural networks have been used previously to some success in modeling pre-extracted synthesis parameters (Blaauw & Bonada, 2017;Chandna et al, 2019), but these models fall short of endto-end learning. The analysis parameters must still be tuned by hand and gradients cannot flow through the synthesis procedure.…”

Section: Oscillator Modelsmentioning

confidence: 99%

DDSP: Differentiable Digital Signal Processing

Engel,

Hantrakul,

et al. 2020

Preprint

View full text Add to dashboard Cite

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available 1 and we welcome further contributions from the community and domain experts.

show abstract

Section: Oscillator Modelsmentioning

confidence: 99%

DDSP: Differentiable Digital Signal Processing

Engel,

Hantrakul,

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Researches to extend the SVS system to the multi-singer system has been conducted relatively recently. [4] proposes a method of expressing each singer's identity by one-hot embedding. This method is straightforward and simple, but has the limitation of requiring re-training each time to add a new singer.…”

Section: Multi-singer Svs Systemmentioning

confidence: 99%

“…The multi-singer SVS system should not only produce natural pronunciation and pitch contour but also suitably reflect the identity of a particular singer. To achieve this, methods for adding conditional inputs reflecting the singer's identity to the network have been proposed [4,5].…”

Section: Introductionmentioning

confidence: 99%

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

Lee

Choi

Junghyun

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this study, we define the identity of the singer with two independent concepts -timbre and singing style -and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into a multi-singer model in the following ways: first, we design a singer identity encoder that can adequately reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.

show abstract

“…Although there are few works focusing on the synthesis of Peking Opera, or more broadly, opera, the synthesis of singing voice has been researched since 1962 when Kelly and Lochbaum [1] used an acoustic tube model to synthesis singing voice with success. Recently, several works [2][3][4][5][6][7] use deep neural networks to synthesis singing voice which, known as parametric systems, process fundamental frequency (or pitch contour, f0) and harmonics features (or timbre) separately. As a typical case among such systems, Neural Parametric Singing Synthesizer (NPSS) [2] using a phoneme timing model, a pitch model and a timbre model each consist a set of neural networks * Yusong Wu performed the work while at Tencent.…”

Section: Introductionmentioning

confidence: 99%

Peking Opera Synthesis via Duration Informed Attention Network

Wu¹,

Li²,

Yu³

et al. 2020

Preprint

View full text Add to dashboard Cite

Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing voice synthesis from a music score. In this work, we propose to deal with this issue and synthesize expressive Peking Opera singing from the music score based on the Duration Informed Attention Network (DurIAN) framework. To tackle the rhythm mismatch, Lagrange multiplier is used to find the optimal output phoneme duration sequence with the constraint of the given note duration from music score. As for the pitch contour mismatch, instead of directly inferring from music score, we adopt a pseudo music score generated from the real singing and feed it as input during training. The experiments demonstrate that with the proposed system we can synthesize Peking Opera singing voice with high-quality timbre, pitch and expressiveness.

show abstract

WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Cited by 5 publications

References 0 publications

DDSP: Differentiable Digital Signal Processing

DDSP: Differentiable Digital Signal Processing

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

Peking Opera Synthesis via Duration Informed Attention Network

Contact Info

Product

Resources

About