A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

Blaauw, Merlijn; Bonada, Jordi

doi:10.3390/app7121313

Cited by 92 publications

(135 citation statements)

References 18 publications

Supporting

Mentioning

133

Contrasting

Order By: Relevance

“…Our proposed system uses 64-dimensional input features similar to [17], extracted with a 10 ms hop time. A reduction factor, r = 2, is used.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.

show abstract

“…Our proposed system uses 64-dimensional input features similar to [17], extracted with a 10 ms hop time. A reduction factor, r = 2, is used.…”

Section: Methodsmentioning

confidence: 99%

“…Non-Seq2Seq singing synthesizers include those based on autoregressive architectures [17,21,22], feed-forward CNN [23], and feed-forward GAN-based approaches [24,25].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer

Blaauw

Bonada

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The system decomposes a speech signal into the fundamental frequency f 0, harmonic spectral envelope and aperiodicity envelope. It has been proved that these parameters can be used to reconstruct a high quality synthesis of speech signals, even after dimensionality reduction techniques have been applied to the parameters [15].…”

Section: World Vocodermentioning

confidence: 99%

A Vocoder Based Method for Singing Voice Extraction

Chandna

Blaauw

Bonada

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper presents a novel method for extracting the vocal track from a musical mixture. The musical mixture consists of a singing voice and a backing track which may comprise of various instruments. We use a convolutional network with skip and residual connections as well as dilated convolutions to estimate vocoder parameters, given the spectrogram of an input mixture. The estimated parameters are then used to synthesize the vocal track, without any interference from the backing track. We evaluate our system, through objective metrics pertinent to audio quality and interference from background sources, and via a comparative subjective evaluation. We use open-source source separation systems based on Non-negative Matrix Factorization (NMFs) and Deep Learning methods as benchmarks for our system and discuss future applications for this particular algorithm.

show abstract

“…It is applied to a specific piano, and the results outperform the earlier methods in note-level polyphonic piano music transcription. Blaauw and Bonada [7] describe a singing synthesizer based on deep neural networks called the Neural Parametric Singing Synthesizer (NPSS), which can generate high-quality singing when a musical score and lyrics are given as the input. The NPSS can learn the timbre and expressive features of a singer from a small set of recordings.…”

Section: Machine and Deep Learningmentioning

confidence: 99%