Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet

Röebel, Axel; Bous, Frederik

doi:10.3390/info13030103

Cited by 6 publications

(2 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To convert the mel-spectrograms to audio for the perceptual test, we use the neural vocoder of [34] which has been shown to work particularly well on singing voice. We use the same universal voice model (trained on speech and singing voice) for synthesis of both speech and singing voice.…”

Section: Audio Synthesismentioning

confidence: 99%

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Bous

Röebel

2022

Information

Self Cite

View full text Add to dashboard Cite

In this publication, we present a deep learning-based method to transform the f0 in speech and singing voice recordings. f0 transformation is performed by training an auto-encoder on the voice signal’s mel-spectrogram and conditioning the auto-encoder on the f0. Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f0 from its latent code. The resulting model successfully applies the desired f0 to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f0 falls out of the range of the source speaker/singer. Using the mean f0 error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f0 from the auto-encoder’s latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f0 transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.

show abstract

Section: Audio Synthesismentioning

confidence: 99%

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Bous

Röebel

2022

Information

Self Cite

View full text Add to dashboard Cite

show abstract

“…If the disentanglement has been successful, the decoder will use the new intensity contour to synthesise a mel-spectrogram with the original properties but with the desired intensity. The mel-spectrograms are inverted with the mel-inverter from [17].…”

Section: Proposed Intensity Transformationsmentioning

confidence: 99%

Analysis and transformations of intensity in singing voice

Bous¹,

Röebel²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we introduce a neural auto-encoder that transforms the voice intensity in recordings of singing voice. Since most recordings of singing voice are not annotated with voice intensity we propose a means to estimate the relative voice intensity from the signal's timbre using a neural intensity estimator. Two methods to overcome the unknown recording factor that relates voice intensity to recorded signal power are given: The unknown recording factor can either be learned alongside the weights of the intensity estimator, or a special loss function based on the scalar product can be used to only match the intensity contour of the recorded signal's power. The intensity models are used to condition a previously introduced bottleneck auto-encoder that disentangles its input, the mel-spectrogram, from the intensity. We evaluate the intensity models by their consistency and by their fitness to provide useful information to the auto-encoder. A perceptive test is carried out that evaluates the perceived intensity change in transformed recordings and the synthesis quality. The perceptive test confirms that changing the conditional input changes the perceived intensity accordingly thus suggesting that the proposed intensity models encode information about the voice intensity.

show abstract

Advancing Naturalistic Affective Science with Deep Learning

Lin,

Bulls,

Tepfer

et al. 2023

Affec Sci

View full text Add to dashboard Cite

People express their own emotions and perceive others' emotions via a variety of channels, including facial movements, body gestures, vocal prosody, and language. Studying these channels of affective behavior offers insight into both the experience and perception of emotion.Prior research has predominantly focused on studying individual channels of affective behavior in isolation using tightly controlled, non-naturalistic experiments. This approach limits our understanding of emotion in more naturalistic contexts where different channels of information tend to interact. Traditional methods struggle to address this limitation: manually annotating behavior is time-consuming, making it infeasible to do at large scale; manually selecting and manipulating stimuli based on hypotheses may neglect unanticipated features, potentially generating biased conclusions; and common linear modeling approaches cannot fully capture the complex, nonlinear, and interactive nature of real-life affective processes. In this methodology

show abstract

Neural Vocoding for Singing and Speaking Voices with the Multi-Band Excited WaveNet

Cited by 6 publications

References 60 publications

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

Analysis and transformations of intensity in singing voice

Advancing Naturalistic Affective Science with Deep Learning

Contact Info

Product

Resources

About