Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Yi, Yuan-Hao; Ai, Yang; Ling, Zhen-Hua; Dai, Li-Rong

doi:10.21437/interspeech.2019-1563

Cited by 31 publications

(26 citation statements)

References 19 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4) Recent Progress on Neural Vocoders: More recently, speaker independent WaveRNN-based neural vocoder [63] became popular as it can generate human-like voices from both in-domain and out-of-domain spectrogram [101]- [103]. Another well-known neural vocoder that achieves high-quality synthesis performance is WaveGlow [64].…”

Section: A Speech Analysis and Reconstructionmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

204

View full text Add to dashboard Cite

Section: A Speech Analysis and Reconstructionmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

204

View full text Add to dashboard Cite

“…Non-Seq2Seq singing synthesizers include those based on autoregressive architectures [17,21,22], feed-forward CNN [23], and feed-forward GAN-based approaches [24,25].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.

show abstract

“…Lyrics-to-singing alignment is important for SVS to decide how long each phoneme is sung in synthesized voices. Previous works [21,42] usually leverage human labeling to split songs into sentences and then conduct phoneme alignment within each sentence by leveraging an HMM (hidden markov model) based speech recognition model. In this paper, we propose a new alignment model to extract the duration of each phoneme, by leveraging raw lyrics and song recordings, without relying on any human labeling efforts.…”

Section: Lyrics-to-singing Alignmentmentioning

confidence: 99%

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Ren

Tan

Qin

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

In this paper 1 , we develop DeepSinger, a multilingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multilingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness 2. CCS CONCEPTS • Computing methodologies → Natural language processing; • Applied computing → Sound and music computing.

show abstract

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Cited by 31 publications

References 19 publications

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Contact Info

Product

Resources

About