WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Chandna, Pankaj; Blaauw, Merlijn; Bonada, Jordi; Gómez, Emilia

doi:10.23919/eusipco.2019.8903099

Cited by 51 publications

(50 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Non-Seq2Seq singing synthesizers include those based on autoregressive architectures [17,21,22], feed-forward CNN [23], and feed-forward GAN-based approaches [24,25].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.

show abstract

“…Non-Seq2Seq singing synthesizers include those based on autoregressive architectures [17,21,22], feed-forward CNN [23], and feed-forward GAN-based approaches [24,25].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Researches to extend the SVS system to the multi-singer system has been conducted relatively recently. [4] proposes a method of expressing each singer's identity by one-hot embedding. This method is straightforward and simple, but has the limitation of requiring re-training each time to add a new singer.…”

Section: Multi-singer Svs Systemmentioning

confidence: 99%

“…The multi-singer SVS system should not only produce natural pronunciation and pitch contour but also suitably reflect the identity of a particular singer. To achieve this, methods for adding conditional inputs reflecting the singer's identity to the network have been proposed [4,5].…”

Section: Introductionmentioning

confidence: 99%

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

Lee

Choi

Junghyun

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this study, we define the identity of the singer with two independent concepts -timbre and singing style -and propose a multi-singer singing synthesis system that can model them separately. To this end, we extend our single-singer model into a multi-singer model in the following ways: first, we design a singer identity encoder that can adequately reflect the identity of a singer. Second, we use encoded singer identity to condition the two independent decoders that model timbre and singing style, respectively. Through a user study with the listening tests, we experimentally verify that the proposed framework is capable of generating a natural singing voice of high quality while independently controlling the timbre and singing style. Also, by using the method of changing singing styles while fixing the timbre, we suggest that our proposed network can produce a more expressive singing voice.

show abstract

“…Previous works on SVS include lyrics-to-singing alignment [6,10,12], parametric synthesis [1,19], acoustic modeling [24,27,29], and adversarial synthesis [5,15,21]. Although they achieve reasonably good performance, these systems typically require 1) a large amount of high-quality singing recordings as training data, and 2) strict data alignments between lyrics and singing audio for accurate singing modeling, both of which incur considerable data labeling cost.…”

Section: Introductionmentioning

confidence: 99%

“…Singing Voice Synthesis. Previous works have conducted studies on SVS from different aspects, including lyrics-to-singing alignment [6,10,12], parametric synthesis [1,19], acoustic modeling [27,29], and adversarial synthesis [5,15,21]. Blaauw and Bonada [1] leverage the WaveNet architecture and separates the influence of pitch and timbre for parametric singing synthesis.…”

Section: Introductionmentioning

confidence: 99%

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Ren

Tan

Qin

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

In this paper 1 , we develop DeepSinger, a multilingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multilingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness 2. CCS CONCEPTS • Computing methodologies → Natural language processing; • Applied computing → Sound and music computing.

show abstract

WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

Cited by 51 publications

References 13 publications

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis System

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Contact Info

Product

Resources

About